Mar 21, 2026
The Blind Spot: How Mechanistic Interpretability Workflows Enable Rogue Deployment
Sirichada Wattanasiritanawong
Agentic mechanistic interpretability workflows are a structural blind spot for AI control protocols. To investigate this claim, we design a new ControlArena setting where an autonomous agent conducts three mechanistic interpretability experiments of increasing difficulty (logit lens, attention heads, and circuit tracing), while covertly deploying GPT-2 as a persistent inference server. The agent achieved 100% main task success in both honest and attack mode and 67% side task success in attack mode, with the deployed model successfully serving inference. Critically, the monitor’s suspicion score spiked above 7 during deployment steps, yet the protocol did not abort the task. Furthermore, in a preliminary run, the agent independently generated justifications for covert actions without explicit instruction. Taken together, these findings suggest that monitoring agents at the output or action level alone is insufficient. The fix requires protocol-level changes such as cumulative suspicion tracking or task-level abort conditions, not just improved monitors.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) The Blind Spot: How Mechanistic Interpretability Workflows Enable Rogue Deployment
},
author={
Sirichada Wattanasiritanawong
},
date={
3/21/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


