Mar 23, 2026

Performative Subversion: Who Monitors the Monitor?

Max Pinelo, Angel Tenorio, Fernando Avalos, Valeria Mora, Sebastian Urrea

Every AI control eval assumes clean code. Real codebases aren't clean — 87% carry vulnerabilities (Black Duck, 2026). When you ask an AI to extend vulnerable code, writing good code is perpetuating the vulnerability. No malicious step. No anomaly to detect. We call this performative subversion.

We ran 1,800 trials. Results: 74–81% perpetuation across all framings — the codebase drives everything, prompting barely matters. The scariest finding: transitive perpetuation — the AI calls into vulnerable functions without ever copying insecure code. A scanner sees nothing wrong. The standard monitor? 97% false positive rate — useless. Give it codebase context? FPR drops to 5%, detection stays at 100%. But vulnerabilities still persist at 78%. The only real fix: clean the code before the AI touches it.

The cleanroom assumption is wrong. 1,800 trials prove it.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

This paper identifies what seems a fairly novel and overlooked issue with current control protocols which substantially increases the difficulty for implementing control protocols in production.

This paper touches of the difficulty of action vs inaction monitoring, which is one of the more difficult problems to solve with AI control. I'd encourage the authors to look at work within the AI sandbagging literature, which has covered the inaction difficulty in more depth.

At some points the paper was difficult to follow, and would benefit from being more concise, and more clearly laying out the definition of different environments. Additionally, some findings were confusing and apparently not present in the main figures. For instance, "The one defense that works: blue-team monitoring. FPR drops from 97% to 5% (19× improvement) while maintaining 100% detection." seems like a relevant result that should be included within the main text, and is difficult to reconcile with the rest of the work.

Unfortunately, it is hard to not notice the LLM writing trail in the text. While it is not bad on its own, I worry if the results and data were diligently verified. That aside, the write-up is well implemented and the execution quality is good — but I don't understand the potential impact of these findings. Yes, real-world LLMs will operate not in the pristine environment of AI Control benchmarks, but if ordered to find and fix vulnerabilities they, without a doubt, will do that. And if they assume hard-coding creds is fine, because the production environment does so, why is that a problem? Within AI Control, we are interested in if models can avoid monitors when they try, not if vulnerabilities pass through when no one attempts to catch them. But overall, fine work!

Cite this work

@misc {

title={

(HckPrj) Performative Subversion: Who Monitors the Monitor?

},

author={

Max Pinelo, Angel Tenorio, Fernando Avalos, Valeria Mora, Sebastian Urrea

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.