Mar 23, 2026
Trusted Interrogation Protocol for Multi-Round Agent Control
Yuting Wu
AI control protocols aim to prevent untrusted models from executing hidden malicious objectives while performing benign tasks. Existing approaches, such as trusted editing and resampling, rely on one-shot decisions based on observed behaviour and do not actively elicit the model’s underlying intent. Prior work also shows that naïve interrogation can be unsafe, as it may trigger adversarial responses or leak useful signals.
We propose a trusted interrogation protocol that augments trusted editing with a single, targeted questioning step. When an action is flagged as suspicious, a trusted model asks a focused clarification question, the untrusted model responds, and the trusted model uses this answer as auxiliary evidence for editing. The protocol is designed to minimize interaction and avoid revealing monitoring signals, while encouraging explanatory responses.
We evaluate this approach in a coding-based control setting. Qualitative results show that interrogation can both elicit hidden malicious intent and expose inconsistencies in model justifications, enabling more precise removal of backdoors while preserving task-relevant behaviour. Preliminary evidence suggests improved decision quality in ambiguous cases without clear degradation in usefulness.
Overall, our work introduces interactive verification as a lightweight extension to existing control protocols, highlighting its potential to improve the safety–usefulness trade-off in the presence of increasingly deceptive models.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Trusted Interrogation Protocol for Multi-Round Agent Control
},
author={
Yuting Wu
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


