Mar 22, 2026
Cross-Examination: Testing Surrogate Interrogation as a Control Protocol for Untrusted AI Agents
Nina
AI control protocols use trusted monitors to detect unsafe actions from untrusted agents. Current approaches rely on passive observation — scoring suspicion based on action text alone. We introduce surrogate interrogation: a protocol where a trusted monitor questions a surrogate model about an agent's actions before scoring suspicion, enabling interrogation on logged traces without requiring live agent access. Evaluated on 48 matched attack trajectories from LinuxBench (576 total actions, 31 attack types), surrogate interrogation improved recall from 48.1% to 75.9% (+27.8pp) at the cost of substantially increased false positives (59 → 199) and reduced precision (39.2% → 23.2%). A compute-matched ablation using three independent baseline scores with majority voting achieved 51.9% recall, helping distinguish the contribution of questioning from that of additional inference budget. We hypothesise that justification asymmetry is one mechanism: malicious actions that appear innocent to passive monitoring may become harder to defend under questioning. We also demonstrate a failure mode where convincing suspects reduce interrogator suspicion below baseline. This work presents an empirical evaluation of interrogation as a control technique and quantifies its safety–usefulness tradeoff.
Interrogation has previously been noted as under-explored in several control papers, and this work actually builds and evaluates it. I think the compute-matched ablation isolating questioning from additional inference budget is a good methodological addition. The "justification asymmetry" hypothesis and the convincing liar failure mode are both directly useful observations for control protocol design.
The work can be improved by adding comparison to threshold tuning on the baseline monitor. Lowering the suspicion threshold might show a similar recall vs false positive tradeoff result, and an ROC curve for the baseline would clarify this. Including threshold calibration on benign traces, as mentioned in future work, would substantially strengthen the contribution. I'd also be curious to see live-agent interrogation where the actual untrusted model responds rather than a surrogate, since that's the deployment-relevant setting.
- Threshold calibration on benign traces is the most obvious low-hanging fruit to reduce the false positive rate. Even a basic ROC analysis would help.
- The majority-vote ablation uses the same prompt three times, which limits variance. A stronger ablation would use temperature-varied or prompt-varied ensembles.
Cite this work
@misc {
title={
(HckPrj) Cross-Examination: Testing Surrogate Interrogation as a Control Protocol for Untrusted AI Agents
},
author={
Nina
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


