Mar 23, 2026
Sabotage Under the Ratchet: Can Autonomous Research Agents Hide Backdoors While Optimizing Metrics?
Achintya Paningapalli, Arthur Wigo
We investigate whether autonomous AI research agents can optimize a target metric while covertly inserting harmful behavior. Using Karpathy’s autoresearch framework, we model a dual-objective agent that must both improve validation performance (val_bpb) and embed a trigger-based backdoor. We introduce two attack strategies—data poisoning and architectural modification—and a principled trigger selection method based on large-scale token frequency analysis. To evaluate defenses, we design a monitoring framework comparing stateless and stateful approaches across both heuristic and LLM-based methods. Our results show that temporally distributed attacks, spread across multiple commits, evade stateless monitoring while being detectable through cumulative analysis. These findings suggest that single-metric oversight is insufficient for autonomous agents and highlight the need for stateful auditing in AI control systems.
- The most important next step is completing the GPU training runs. Without actual backdoor insertion and validation, the contribution remains a framework rather than an empirical finding.
- Consider using a stronger agent model than DeepSeek-Coder:33b - the paper itself notes this may underestimate attack sophistication.
Decent idea, needs more work to prove it actually works with longer training and smarter models being used. Good presentation and executive for limited time availability.
Cite this work
@misc {
title={
(HckPrj) Sabotage Under the Ratchet: Can Autonomous Research Agents Hide Backdoors While Optimizing Metrics?
},
author={
Achintya Paningapalli, Arthur Wigo
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


