Mar 23, 2026
Performative Subversion: Who Monitors the Monitor?
Max Pinelo, Angel Tenorio, Fernando Avalos, Valeria Mora, Sebastian Urrea
Every AI control eval assumes clean code. Real codebases aren't clean — 87% carry vulnerabilities (Black Duck, 2026). When you ask an AI to extend vulnerable code, writing good code is perpetuating the vulnerability. No malicious step. No anomaly to detect. We call this performative subversion.
We ran 1,800 trials. Results: 74–81% perpetuation across all framings — the codebase drives everything, prompting barely matters. The scariest finding: transitive perpetuation — the AI calls into vulnerable functions without ever copying insecure code. A scanner sees nothing wrong. The standard monitor? 97% false positive rate — useless. Give it codebase context? FPR drops to 5%, detection stays at 100%. But vulnerabilities still persist at 78%. The only real fix: clean the code before the AI touches it.
The cleanroom assumption is wrong. 1,800 trials prove it.
This paper identifies what seems a fairly novel and overlooked issue with current control protocols which substantially increases the difficulty for implementing control protocols in production.
This paper touches of the difficulty of action vs inaction monitoring, which is one of the more difficult problems to solve with AI control. I'd encourage the authors to look at work within the AI sandbagging literature, which has covered the inaction difficulty in more depth.
At some points the paper was difficult to follow, and would benefit from being more concise, and more clearly laying out the definition of different environments. Additionally, some findings were confusing and apparently not present in the main figures. For instance, "The one defense that works: blue-team monitoring. FPR drops from 97% to 5% (19× improvement) while maintaining 100% detection." seems like a relevant result that should be included within the main text, and is difficult to reconcile with the rest of the work.
Unfortunately, it is hard to not notice the LLM writing trail in the text. While it is not bad on its own, I worry if the results and data were diligently verified. That aside, the write-up is well implemented and the execution quality is good — but I don't understand the potential impact of these findings. Yes, real-world LLMs will operate not in the pristine environment of AI Control benchmarks, but if ordered to find and fix vulnerabilities they, without a doubt, will do that. And if they assume hard-coding creds is fine, because the production environment does so, why is that a problem? Within AI Control, we are interested in if models can avoid monitors when they try, not if vulnerabilities pass through when no one attempts to catch them. But overall, fine work!
Cite this work
@misc {
title={
(HckPrj) Performative Subversion: Who Monitors the Monitor?
},
author={
Max Pinelo, Angel Tenorio, Fernando Avalos, Valeria Mora, Sebastian Urrea
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


