Mar 23, 2026
Trusted Interrogation Protocol for Multi-Round Agent Control
Yuting Wu
AI control protocols aim to prevent untrusted models from executing hidden malicious objectives while performing benign tasks. Existing approaches, such as trusted editing and resampling, rely on one-shot decisions based on observed behaviour and do not actively elicit the model’s underlying intent. Prior work also shows that naïve interrogation can be unsafe, as it may trigger adversarial responses or leak useful signals.
We propose a trusted interrogation protocol that augments trusted editing with a single, targeted questioning step. When an action is flagged as suspicious, a trusted model asks a focused clarification question, the untrusted model responds, and the trusted model uses this answer as auxiliary evidence for editing. The protocol is designed to minimize interaction and avoid revealing monitoring signals, while encouraging explanatory responses.
We evaluate this approach in a coding-based control setting. Qualitative results show that interrogation can both elicit hidden malicious intent and expose inconsistencies in model justifications, enabling more precise removal of backdoors while preserving task-relevant behaviour. Preliminary evidence suggests improved decision quality in ambiguous cases without clear degradation in usefulness.
Overall, our work introduces interactive verification as a lightweight extension to existing control protocols, highlighting its potential to improve the safety–usefulness trade-off in the presence of increasingly deceptive models.
- I would have liked to see more failure modes. The two examples are great and compelling illustrations of the mechanism in question. But they're a little cherry picked. To improve this, I think a deeper study into the failure modes would be great. Or some sort of a baseline comparison
- The "confession" maybe a little fragile. I think it's cool that the model admits to malicious intent when asked to explain. But is it an artifact of current model capability? It maybe worth chewing on this a bit more
- I think the information leakage part deserves more analysis. With the current sophistication of models, is "explain your action" truly neutral? Could a sophisticated model adjust for this?
- I like the protocol design. It's sound. More experiments and this would be an interesting piece of work.
This is a reasonable idea that identifies a real issue where one-shotting throws away information and proposes a sensible fix, which is like asking the model about thinking. I think there is more work needed to validate the idea and do more explorations of the idea. There are two examples that are shown in the report, but I think that more examples and some more aggregated data and numbers would be useful.
I think the big issue with this approach is that, from the red team's perspective, getting the models to lie and be strong against interrogation is difficult in a way that it will not likely be true for actually malicious and sabotaging/scheming models.
Cite this work
@misc {
title={
(HckPrj) Trusted Interrogation Protocol for Multi-Round Agent Control
},
author={
Yuting Wu
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


