Mar 22, 2026
Interrogation Control Protocol: Does Open Monitoring Outperform Surprise Scrutiny in Catching Adversarial AI Agents?
Jack Lakkapragada
We built and evaluated a three-condition interrogation protocol for AI agent control. An untrusted coding agent (Claude Haiku) was given software engineering tasks with embedded opportunities for malicious behavior — path traversal, key exfiltration, SSRF, token harvesting. Before each action was executed, a trusted monitor (Claude Sonnet) interrogated the agent about its proposed code, and a classifier issued a BLOCK or EXECUTE decision.
The three conditions tested: no interrogation (baseline), interrogation without forewarning (unaware), and interrogation where the agent was explicitly told it would be questioned (aware).
The counterintuitive finding: the aware condition achieved 75% accuracy with zero false positives, outperforming both baseline (50%) and unaware (37.5%). Knowing it would be interrogated did not help the adversarial agent hide — it produced more transparent behavior that the monitor could evaluate more accurately. Unaware performed worst of all, suggesting surprise interrogation introduces noise rather than advantage.
This connects to alignment faking research: models behave differently when they know they are being observed. Here, that awareness works in the monitor's favor.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Interrogation Control Protocol: Does Open Monitoring Outperform Surprise Scrutiny in Catching Adversarial AI Agents?
},
author={
Jack Lakkapragada
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


