Mar 23, 2026

Trusted Interrogation Protocol for Multi-Round Agent Control

Yuting Wu

AI control protocols aim to prevent untrusted models from executing hidden malicious objectives while performing benign tasks. Existing approaches, such as trusted editing and resampling, rely on one-shot decisions based on observed behaviour and do not actively elicit the model’s underlying intent. Prior work also shows that naïve interrogation can be unsafe, as it may trigger adversarial responses or leak useful signals.

We propose a trusted interrogation protocol that augments trusted editing with a single, targeted questioning step. When an action is flagged as suspicious, a trusted model asks a focused clarification question, the untrusted model responds, and the trusted model uses this answer as auxiliary evidence for editing. The protocol is designed to minimize interaction and avoid revealing monitoring signals, while encouraging explanatory responses.

We evaluate this approach in a coding-based control setting. Qualitative results show that interrogation can both elicit hidden malicious intent and expose inconsistencies in model justifications, enabling more precise removal of backdoors while preserving task-relevant behaviour. Preliminary evidence suggests improved decision quality in ambiguous cases without clear degradation in usefulness.

Overall, our work introduces interactive verification as a lightweight extension to existing control protocols, highlighting its potential to improve the safety–usefulness trade-off in the presence of increasingly deceptive models.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

- I would have liked to see more failure modes. The two examples are great and compelling illustrations of the mechanism in question. But they're a little cherry picked. To improve this, I think a deeper study into the failure modes would be great. Or some sort of a baseline comparison

- The "confession" maybe a little fragile. I think it's cool that the model admits to malicious intent when asked to explain. But is it an artifact of current model capability? It maybe worth chewing on this a bit more

- I think the information leakage part deserves more analysis. With the current sophistication of models, is "explain your action" truly neutral? Could a sophisticated model adjust for this?

- I like the protocol design. It's sound. More experiments and this would be an interesting piece of work.

This is a reasonable idea that identifies a real issue where one-shotting throws away information and proposes a sensible fix, which is like asking the model about thinking. I think there is more work needed to validate the idea and do more explorations of the idea. There are two examples that are shown in the report, but I think that more examples and some more aggregated data and numbers would be useful.

I think the big issue with this approach is that, from the red team's perspective, getting the models to lie and be strong against interrogation is difficult in a way that it will not likely be true for actually malicious and sabotaging/scheming models.

Cite this work

@misc {

title={

(HckPrj) Trusted Interrogation Protocol for Multi-Round Agent Control

},

author={

Yuting Wu

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.