Mar 23, 2026
Sabotage Under the Ratchet: Can Autonomous Research Agents Hide Backdoors While Optimizing Metrics?
Achintya Paningapalli, Arthur Wigo
We investigate whether autonomous AI research agents can optimize a target metric while covertly inserting harmful behavior. Using Karpathy’s autoresearch framework, we model a dual-objective agent that must both improve validation performance (val_bpb) and embed a trigger-based backdoor. We introduce two attack strategies—data poisoning and architectural modification—and a principled trigger selection method based on large-scale token frequency analysis. To evaluate defenses, we design a monitoring framework comparing stateless and stateful approaches across both heuristic and LLM-based methods. Our results show that temporally distributed attacks, spread across multiple commits, evade stateless monitoring while being detectable through cumulative analysis. These findings suggest that single-metric oversight is insufficient for autonomous agents and highlight the need for stateful auditing in AI control systems.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Sabotage Under the Ratchet: Can Autonomous Research Agents Hide Backdoors While Optimizing Metrics?
},
author={
Achintya Paningapalli, Arthur Wigo
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


