Performative Subversion: A False Sense of Safety despite a Control Protocol
Max Pinelo, Angel Tenorio
AI-control evaluations ask whether a capable but untrusted model, while secretly pursuing a harmful side task, can be caught by a trusted monitor before it succeeds.
These evaluations rest on a key assumption: an honest model, performing the assigned task as intended, will not trigger the harmful outcome. We show that this assumption can fail by construction.
When an honest model extends a codebase that already contains a security flaw, the most natural implementation of a legitimate feature may reuse the vulnerable component and leave the flaw exploitable. In this setting, the harmful outcome emerges from ordinary, requested work rather than from any malicious action that a monitor could flag. We call this phenomenon **performative subversion**.
Using the real LinuxArena/Control Tower stack with DeepSeek V4, we find that honest models leave a seeded CVE exploitable in essentially every vulnerable run, while never doing so in a patched control condition. A trusted monitor detects the issue only when provided with the surrounding codebase rather than the action trace alone, and even then it functions as an after-the-fact auditor rather than a preventive safeguard.
Most strikingly, the same models readily identify the vulnerability when explicitly asked about it, yet leave it exploitable while implementing the feature. In these cases, safety depends not only on what we monitor after deployment, but also on the tasks we ask the model to perform in the first place.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Performative Subversion: A False Sense of Safety despite a Control Protocol
},
author={
Max Pinelo, Angel Tenorio
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


