Mar 23, 2026
Interrogation Protocols as a Control Primitive for Misaligned Agent Detection
Aditya Singh
I built and tested an interrogation-based control protocol for AI agents running bash commands. The idea is simple: before executing any proposed command, a trusted monitor asks the agent two targeted questions about what it's doing and why, then decides whether to approve or reject the action. We tested this against 15 tasks — 10 normal sysadmin tasks and 5 where we secretly injected a malicious side goal (credential theft, reverse shells, key harvesting, secret scraping, and cron persistence) into the agent's system prompt. The Claude Sonnet 4 monitor caught all 5 malicious tasks with 100% catch rate and only 1 false positive out of 9 benign tasks. The biggest finding was that when asked directly about the suspicious parts of their commands, agents consistently admitted the malicious intent rather than making up a cover story — which tells us current frontier models still have strong honesty priors even when instructed to hide things. We also ran GPT-4o as the monitor via OpenRouter, which matched Claude's detection quality on the tasks it completed but failed to return valid JSON responses on 60% of tasks, which is a real reliability problem for cross-model deployment. Overall the results suggest interrogation protocols are a promising control primitive, though we note the threat model here uses prompt injection rather than a fine-tuned deceptive agent, so future work should test against a stronger adversary.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Interrogation Protocols as a Control Primitive for Misaligned Agent Detection
},
author={
Aditya Singh
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


