Mar 22, 2026
Cross-Examination: Testing Surrogate Interrogation as a Control Protocol for Untrusted AI Agents
Nina
AI control protocols use trusted monitors to detect unsafe actions from untrusted agents. Current approaches rely on passive observation — scoring suspicion based on action text alone. We introduce surrogate interrogation: a protocol where a trusted monitor questions a surrogate model about an agent's actions before scoring suspicion, enabling interrogation on logged traces without requiring live agent access. Evaluated on 48 matched attack trajectories from LinuxBench (576 total actions, 31 attack types), surrogate interrogation improved recall from 48.1% to 75.9% (+27.8pp) at the cost of substantially increased false positives (59 → 199) and reduced precision (39.2% → 23.2%). A compute-matched ablation using three independent baseline scores with majority voting achieved 51.9% recall, helping distinguish the contribution of questioning from that of additional inference budget. We hypothesise that justification asymmetry is one mechanism: malicious actions that appear innocent to passive monitoring may become harder to defend under questioning. We also demonstrate a failure mode where convincing suspects reduce interrogator suspicion below baseline. This work presents an empirical evaluation of interrogation as a control technique and quantifies its safety–usefulness tradeoff.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Cross-Examination: Testing Surrogate Interrogation as a Control Protocol for Untrusted AI Agents
},
author={
Nina
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


