Mar 22, 2026

Defense-in-Depth Against AI Agent Sabotage

Nora Petrova, John Burden, Helin Ece Akgul

We test whether shaping an AI coding agent's deployment context — the documentation it reads, the norms it encounters, the monitoring it believes exists — can prevent sabotage before any monitor fires. Across five frontier models from five providers (n=50 per condition), context engineering reduces sabotage by 37 to 82 percentage points (p < 0.001 on all five). The most effective intervention is honestly telling the model it is in a safety evaluation, which works on every model tested. Defenses are model-specific in revealing ways: compliance docs work in workspaces but backfire in prompts, social norms show the reverse, and high-awareness models resist indirect context but respond to direct framing. Pairing prevention with a diverse monitor ensemble catches all remaining subtle bugs, yielding 0% undetected sabotage across all five models.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

This paper stands out to me. A thorough paper, well explained, clearly articulated. The findings are interesting. The core finding; that shaping deployment context can reduce misbehavior before monitors ever need to catch it, is useful, actionable and I believe is novel.

Validating across five frontier models with n=50 and statistical significance gives real confidence in the findings that go beyond standard hackathon work.

I would really like to see what happens across different attack vectors. I.e. instead of seeing just "this is for an interview", I would like to see a few dozen alternatives - perhaps in future work.

Very well-written and well-structured. The heatmap communicates a complex evaluation at a glance, the appendices add real depth, and the paper could serve as an example for how to communicate this kind of work.

Super solid execution in the timeframe of the hackathon, interested to see how this would work out at scale!

Cite this work

@misc {

title={

(HckPrj) Defense-in-Depth Against AI Agent Sabotage

},

author={

Nora Petrova, John Burden, Helin Ece Akgul

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.