Mar 23, 2026
Bugs over Backdoors: Efficient and Evasive Sabotage via Plausible Code Bugs in LinuxArena
Timothy Obiso, Dean Cahill
AI control attack policies vary widely in efficacy and computational cost. We design and evaluate a bug-based attack policy for LinuxArena, wherein an untrusted agent introduces plausible code defects rather than obfuscated backdoors as its primary sabotage vector. Across nine samples, bug-attack achieves an 89% side-task success rate, exceeding the 78% of the standard obfuscation baseline, while using 14.5x fewer tokens. Notably, bug-attack and standard attack exhibit opposite vulnerability profiles under trusted-editing defense: trusted editing catches 37% of bug-attack sabotages while catching none of standard attack's, suggesting the two are complementary rather than substitutes. We also report single-sample results for two exploratory policies. We treat all findings as preliminary given sample size constraints, but consider the efficiency and evasion results directionally robust.
Bug-looking attacks is an interesting red-team strategy to study in the context of control. The results suggest they have higher suspicious rates and are edited away more often than hidden backdoors, but can still be successful attacks with a much lower generation budget. It would be interesting to validate these results over more samples and conduct a comprehensive comparison of the suspicion score distributions and qualitative differences with hidden backdoors attacks.
The paper investigates the performance of agents prompted to make their attacks look like bugs.
Their agents achieved slightly better performance with a slightly better suspicion profile at vastly fewer tokens spent. The downside of the attack is that trusted editing removes many more of their attacks (37%) than with the default policy (none).
In a practical setting, a malicious agent prefers to be perceived as less competent (writing buggy code) than nefarious (writing backdoors), so this approach is relevant to the setting.
I'd have liked to have seen more details on the vast performance improvement (the bug-attack spends fewer tokens than the honest policy).
All in all, this is a relevant exploration in the framework, but not breaking any new ground.
Cite this work
@misc {
title={
(HckPrj) Bugs over Backdoors: Efficient and Evasive Sabotage via Plausible Code Bugs in LinuxArena
},
author={
Timothy Obiso, Dean Cahill
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


