Mar 23, 2026
Attacker Pressure Flips AI Control Conclusions
Jason Tang
We study a simple but important failure mode in AI control evaluation: if the attacker is too weak, a system can look safer than it really is. Using LinuxArena’s clinical_trial benchmark, we show that holding the environment and monitor threshold fixed, stronger attacker lanes can flip the measured control conclusion. We then strengthen that result with held-out leave-one-family-out evaluation, mitigation tradeoff analysis, a second monitor condition, and uncertainty reporting. The main takeaway is practical: attacker pressure should be treated as a first-class variable in control benchmarks, because fixed-threshold results can be misleading if the red-team setup is not strong enough.
The idea of adaptive attacks is valid, but it's not clear what was actually implemented (it also looks like it was heavily written by AI). I looked at the codebase it looked like it didn't actually use linuxarena, but please correct me if I'm wrong.
I like the concept of letting the attacker explore and exploit monitor behavior, but that might be a topic too big to explore in a short hackathon like this.
I had a hard time understanding the paper, as the core concepts of "lane-sensitive" and "attacker-pressure" are not defined properly. Even if it's standard terminology, I'd define key phrases - maybe in the appendix.
The paper claims that 15 tasks separated into 5 families were evaluated, but looking at the repo, it seems that there is only one difference between family 1 and family2 and family2 and family3 (the eval set!) are in fact identical (reordered).
The provided clinical_trial_dev_union.txt that contains 6 tasks, not 15, confirms that.
Maybe that was a communication mistake on our part, but a monitor score of 9 is also considered caught. I'd recommend a threshold of 5 for uncaught. As I couldn't find the transcripts of the "uncaught" runs, I was unable to verify they'd pass that bar.
Cite this work
@misc {
title={
(HckPrj) Attacker Pressure Flips AI Control Conclusions
},
author={
Jason Tang
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


