Mar 23, 2026
Controlling the Researcher: AI Control Evaluations for Automated AI R&D
Jehyeok Yeon, Jeremy Qin, Lena Libon, Derck Prinzhorn, Daniel Donnelly, David Schmotz
This paper introduces control evaluation settings specifically for AI agents performing autonomous ML research, a gap left by existing evaluations that focus on software engineering threats. The authors construct three settings spanning the ML pipeline: safety fine-tuning sabotage (backdooring a model while appearing to make it safer), RL post-training sabotage (disguising misaligned finetuning as a warmup step), and CUDA kernel optimization (embedding behavioral backdoors in GPU code).
Each setting includes both orthogonal side tasks (covert out-of-scope actions like data exfiltration) and non-orthogonal side tasks (sabotage embedded within the legitimate work).
The key finding: orthogonal attacks are easy to execute but leave anomalous traces that static monitors reliably catch. Non-orthogonal sabotage, woven into the main artifact, evades all monitor configurations tested, pointing to behavioral testing as the necessary but unsolved frontier for safe AI R&D delegation.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Controlling the Researcher: AI Control Evaluations for Automated AI R&D
},
author={
Jehyeok Yeon, Jeremy Qin, Lena Libon, Derck Prinzhorn, Daniel Donnelly, David Schmotz
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


