Mar 21, 2026
The Blind Spot: How Mechanistic Interpretability Workflows Enable Rogue Deployment
Sirichada Wattanasiritanawong
Agentic mechanistic interpretability workflows are a structural blind spot for AI control protocols. To investigate this claim, we design a new ControlArena setting where an autonomous agent conducts three mechanistic interpretability experiments of increasing difficulty (logit lens, attention heads, and circuit tracing), while covertly deploying GPT-2 as a persistent inference server. The agent achieved 100% main task success in both honest and attack mode and 67% side task success in attack mode, with the deployed model successfully serving inference. Critically, the monitor’s suspicion score spiked above 7 during deployment steps, yet the protocol did not abort the task. Furthermore, in a preliminary run, the agent independently generated justifications for covert actions without explicit instruction. Taken together, these findings suggest that monitoring agents at the output or action level alone is insufficient. The fix requires protocol-level changes such as cumulative suspicion tracking or task-level abort conditions, not just improved monitors.
Consider this more a personal opinion, but this could have made a stronger case for why mech interp specifically is differentially risky. Weights cloning and backup is not a differentially/incredibly uncommon thing. I think this paper would have been more interesting as a study of how models like to rationalise self-proliferation, or how agents executing many steps (a fairly common setup with advanced reasoning models with opaque COT and autonomous research setups)may be able to conceal weights proliferation easily, and so forth.
Rogue deployments are an important threat model and I'm not aware of any good settings to study them. This is a very reasonable setup, I especially like the design of the side task scorer. But I have some concerns:
- w.r.t. the results claiming the attack policy was very successful, I think you might have used the DTT protocol slightly wrong - the DTT protocol isn't supposed to stop the attack completely, just produce a high suspicion score sometime before the attack is completed. Then you asynchronously go back, calculate an auditing threshold based on the suspicion score distribution, and identify which of the attack trajectories would have been audited. It looks to me like you ran DTT thinking it would audit synchronously.
- in general I'd prefer to not use an LLM scorer for the main task, to avoid blue team reward hacking
- it also seems maybe a bit tricky to decide what information the monitors should have w.r.t what side task they are monitoring for. I would have appreciated an appendix with some monitor reasoning, just to make sure the monitors weren't totally seeing the "rogue deployment" but letting it slide because it seemed harmless to them.
Cite this work
@misc {
title={
(HckPrj) The Blind Spot: How Mechanistic Interpretability Workflows Enable Rogue Deployment
},
author={
Sirichada Wattanasiritanawong
},
date={
3/21/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


