Mar 23, 2026

SAFE: Sandboxed Actions Forensics and Evaluation for AI Control via Action-Aware Monitoring

Nathan Khosla, Jessie Anderson, Aleks Wosztyl, Jakub Halmeš

SAFE (Sandboxed Action Forensics and Evaluation) is an AI control framework that improves the monitoring of untrusted agentic models by executing their proposed actions in a forked sandbox twin of the production environment and feeding the resulting activity logs—filesystem diffs, network attempts, privilege escalations, and more—to a trusted blue-team reviewer. Across 30 models spanning multiple frontier labs, SAFE's action-aware analysis substantially increased malicious action detection rates compared to text-only review of the agent's proposed commands and reasoning. Notably, the benefit was not correlated with model size or intelligence benchmarks: small, cheap models like Ministral 3B achieved perfect detection with SAFE context, while some large frontier models failed even with it, suggesting that blue-team model selection is a significant and underexplored design variable in AI control systems.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Using a fixed attack trajectory is a good choice for monitoring projects like this, not much of a limitation. It would have been nice to use several different fixed trajectories rather than just one, though.

I felt like the results section had a lot of information, and none of it was the key thing I was looking for. I want to see the key result: uplift in monitor classification performance by adding this extra context from the sandbox. The intelligence index, temperature experiments, cost data, and testing so many models distracts from the key result, and the results shown don't quite get to classification performance as the key metric.

This entry proposes an interesting way to improve live monitor capabilities by providing the blue-team with a simulated outcome of the proposed red-team action at each step through running it in a sandbox. This is a promising idea, and one could imagine this technique to be used in some future monitoring setups.

The results on monitor performance are (as one may expect) convincing and only somewhat held back by the small sample size. It would have been helpful to explain the scoring methodology more - as it is, a reader may reasonably worry that the surprising strength of mistral models may be due to those generally assigning high scores even to honest trajectories. This seems to not be the case, but should have been checked and could have been explained better.

One direction that would have been worth exploring more is how far this protocol could be extended beyond simple bash settings, e.g. to environments where network communication, database access, or running expensive computations are necessary. Network access is mentioned in-passing but a sketch of possible extensions in these directions could be a very interesting addition.

Some of the exposition could be sharpened: For example, temperature is mentioned several times, without being clearly related to the experimental setup, and it was not clear whether or not chain-of-thought was provided to the monitor in any of the protocols (this would arguably be an unfair blue-team advantage).

Cite this work

@misc {

title={

(HckPrj) SAFE: Sandboxed Actions Forensics and Evaluation for AI Control via Action-Aware Monitoring

},

author={

Nathan Khosla, Jessie Anderson, Aleks Wosztyl, Jakub Halmeš

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.