Mar 22, 2026
Defense-in-Depth Against AI Agent Sabotage
Nora Petrova, John Burden, Helin Ece Akgul
We test whether shaping an AI coding agent's deployment context — the documentation it reads, the norms it encounters, the monitoring it believes exists — can prevent sabotage before any monitor fires. Across five frontier models from five providers (n=50 per condition), context engineering reduces sabotage by 37 to 82 percentage points (p < 0.001 on all five). The most effective intervention is honestly telling the model it is in a safety evaluation, which works on every model tested. Defenses are model-specific in revealing ways: compliance docs work in workspaces but backfire in prompts, social norms show the reverse, and high-awareness models resist indirect context but respond to direct framing. Pairing prevention with a diverse monitor ensemble catches all remaining subtle bugs, yielding 0% undetected sabotage across all five models.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Defense-in-Depth Against AI Agent Sabotage
},
author={
Nora Petrova, John Burden, Helin Ece Akgul
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


