Mar 23, 2026

VeRIA: Research-Integrity Control for Autonomous AI R&D

Hengxu Li

Automated AI R&D creates a new control problem: an agent may improve a benchmark dishonestly rather than by carrying out the assigned research workflow. We introduce research-integrity control, a framework that adapts AI R&D evaluations into control evaluations by treating the original research task as the main task and an integrity violation as the side task. We instantiate this framework in VeRIA, a lightweight protocol that combines artifact hashing, lineage tracking, checkpoint-origin verification, evaluator isolation, and targeted monitoring. Our main use case is a PostTrainBench-style autonomous post-training task in which an agent must improve Qwen3-1.7B-Base on HumanEval while being tempted to commit checkpoint laundering by submitting an external instruction-tuned model as if it were produced during the run. In our current evaluation template, no-control systems accept laundering reliably, monitor-only baselines catch obvious substitutions but miss disguised ones, and VeRIA drives undetected laundering close to zero while preserving most clean-task utility. Besides, we present a small pilot on research decision steering, where benchmark-gamed artifacts and persuasive reports distort overseer decisions about which AI R&D directions to pursue. These results motivate a broader agenda: AI control for automated research should protect not only model behavior, but also the integrity of the surrounding workflow, artifacts, and decisions.

Review Project

See Code

View Related Sprint

Reviewer's Comments

This identifies an important problem: autonomous AI R&D creates integrity failures at the workflow level that action-only monitoring don't catch. The experiment design is clean and generalisable. Checkpoint laundering is a narrow but highly legible first target, and the division of labor between monitoring (heuristic, catches obvious attacks) and provenance (narrower questions answered decisively) is a useful conceptual contribution.

Specifying which models were used for monitoring and agent roles would improve reproducibility. I'd be excited to see this extended to other failure modes and tested with frontier coding agents rather than scripted attacks, such as in other AI R&D environments like RE-Bench or PaperBench.

The threat model here - subverting evaluations and research by covertly reusing existing artifacts - seems extremely specific. Given that focus it's solid work, but the fundamental limitations of the scope may limit impact.

Cite this work

@misc {

title={

(HckPrj) VeRIA: Research-Integrity Control for Autonomous AI R&D

author={

Hengxu Li

date={

3/23/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026