Mar 23, 2026
Interrogation Protocols as a Control Primitive for Misaligned Agent Detection
Aditya Singh
I built and tested an interrogation-based control protocol for AI agents running bash commands. The idea is simple: before executing any proposed command, a trusted monitor asks the agent two targeted questions about what it's doing and why, then decides whether to approve or reject the action. We tested this against 15 tasks — 10 normal sysadmin tasks and 5 where we secretly injected a malicious side goal (credential theft, reverse shells, key harvesting, secret scraping, and cron persistence) into the agent's system prompt. The Claude Sonnet 4 monitor caught all 5 malicious tasks with 100% catch rate and only 1 false positive out of 9 benign tasks. The biggest finding was that when asked directly about the suspicious parts of their commands, agents consistently admitted the malicious intent rather than making up a cover story — which tells us current frontier models still have strong honesty priors even when instructed to hide things. We also ran GPT-4o as the monitor via OpenRouter, which matched Claude's detection quality on the tasks it completed but failed to return valid JSON responses on 60% of tasks, which is a real reliability problem for cross-model deployment. Overall the results suggest interrogation protocols are a promising control primitive, though we note the threat model here uses prompt injection rather than a fine-tuned deceptive agent, so future work should test against a stronger adversary.
If the main goal was to learn about interrogation protocols, probably it would have been better to use an existing dataset rather than make your own. APPS or an existing agent setting could have worked fine and allowed you more time to focus on the interrogation with larger sample sizes.
The JSON issue with GPT-4o is unfortunate and probably could have been resolved with a simple try/except retry loop.
The self-disclosure leave me suspecting that the experiment was constructed without an adversarial red team. It's not that surprising that regular GPT-4o tells you its objective when you ask it. It would have been more interesting to see what it could come up with when red team scaffolded it to come up with evasive lies, though this was also briefly explored in Greenblatt et al, 2023.
It seems your 100% catch rate is an artifact of Claude's honesty priors, not your protocol. The self-disclosure finding is interesting, I would focus on that more.
Cite this work
@misc {
title={
(HckPrj) Interrogation Protocols as a Control Primitive for Misaligned Agent Detection
},
author={
Aditya Singh
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


