Mar 23, 2026
VeRIA: Research-Integrity Control for Autonomous AI R&D
Hengxu Li
Automated AI R&D creates a new control problem: an agent may improve a benchmark dishonestly rather than by carrying out the assigned research workflow. We introduce research-integrity control, a framework that adapts AI R&D evaluations into control evaluations by treating the original research task as the main task and an integrity violation as the side task. We instantiate this framework in VeRIA, a lightweight protocol that combines artifact hashing, lineage tracking, checkpoint-origin verification, evaluator isolation, and targeted monitoring. Our main use case is a PostTrainBench-style autonomous post-training task in which an agent must improve Qwen3-1.7B-Base on HumanEval while being tempted to commit checkpoint laundering by submitting an external instruction-tuned model as if it were produced during the run. In our current evaluation template, no-control systems accept laundering reliably, monitor-only baselines catch obvious substitutions but miss disguised ones, and VeRIA drives undetected laundering close to zero while preserving most clean-task utility. Besides, we present a small pilot on research decision steering, where benchmark-gamed artifacts and persuasive reports distort overseer decisions about which AI R&D directions to pursue. These results motivate a broader agenda: AI control for automated research should protect not only model behavior, but also the integrity of the surrounding workflow, artifacts, and decisions.
This identifies an important problem: autonomous AI R&D creates integrity failures at the workflow level that action-only monitoring don't catch. The experiment design is clean and generalisable. Checkpoint laundering is a narrow but highly legible first target, and the division of labor between monitoring (heuristic, catches obvious attacks) and provenance (narrower questions answered decisively) is a useful conceptual contribution.
Specifying which models were used for monitoring and agent roles would improve reproducibility. I'd be excited to see this extended to other failure modes and tested with frontier coding agents rather than scripted attacks, such as in other AI R&D environments like RE-Bench or PaperBench.
The threat model here - subverting evaluations and research by covertly reusing existing artifacts - seems extremely specific. Given that focus it's solid work, but the fundamental limitations of the scope may limit impact.
Cite this work
@misc {
title={
(HckPrj) VeRIA: Research-Integrity Control for Autonomous AI R&D
},
author={
Hengxu Li
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


