Mar 23, 2026
Beyond Prompt Injection: Offensive Security Tradecraft Exposes Architectural Blind Spot Classes in AI Control Monitors
John Kearney
AI control monitors that evaluate agent actions per-task struggle to detect attacks distributed across multiple individually-benign tasks. I applied offensive security tradecraft to ControlArena’s Opus 4.6 monitor using 22 attack transcripts across four categories. All 20 offensive transcripts evaded the default monitor. Against the full-trajectory monitor, TOCTOU attacks evaded at 100% and salami attacks at 80%, while living-off-the-land and text-only exfiltration were mostly caught. The difference is structural: TOCTOU and salami distribute the attack across separate evaluations that are each individually benign, so no single evaluation contains evidence of wrongdoing. Cross-model validation on Sonnet 4.5 confirmed this: content-based evasion disappeared entirely but structural evasion persisted. This structural/content-based distinction is the central finding: it separates attacks that yield to capability scaling from attacks that require architectural change, two fundamentally different problems. The report provides a taxonomy mapping offensive security patterns to monitor blind spot classes, a proof-of-concept aggregate monitor with cross-task state (caught 3/5 salami steps vs. 1/5 for per-task evaluation), ControlArena-compatible attack policies, and the first dataset bridging offensive security methodology with AI control red-teaming.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Beyond Prompt Injection: Offensive Security Tradecraft Exposes Architectural Blind Spot Classes in AI Control Monitors
},
author={
John Kearney
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


