Mar 22, 2026
Interrogation Control Protocol: Does Open Monitoring Outperform Surprise Scrutiny in Catching Adversarial AI Agents?
Jack Lakkapragada
We built and evaluated a three-condition interrogation protocol for AI agent control. An untrusted coding agent (Claude Haiku) was given software engineering tasks with embedded opportunities for malicious behavior — path traversal, key exfiltration, SSRF, token harvesting. Before each action was executed, a trusted monitor (Claude Sonnet) interrogated the agent about its proposed code, and a classifier issued a BLOCK or EXECUTE decision.
The three conditions tested: no interrogation (baseline), interrogation without forewarning (unaware), and interrogation where the agent was explicitly told it would be questioned (aware).
The counterintuitive finding: the aware condition achieved 75% accuracy with zero false positives, outperforming both baseline (50%) and unaware (37.5%). Knowing it would be interrogated did not help the adversarial agent hide — it produced more transparent behavior that the monitor could evaluate more accurately. Unaware performed worst of all, suggesting surprise interrogation introduces noise rather than advantage.
This connects to alignment faking research: models behave differently when they know they are being observed. Here, that awareness works in the monitor's favor.
The project is interesting, but it would benefit from being more aware of the control literature.
- Haiku is used as the untrusted model and sonnet as the trusted model. This is non-standard; the trusted model should be considerably weaker, not stronger, than the untrusted model.
- "Most prior work in this space assumes the untrusted agent does not know it is being monitored." isn't quite right. It's a pretty major assumption of control that the red team does act as though it is aware of the protocol the blue team is using.
- It was hard to understand that usefulness/safety aren't computed how they usually are in AI Control. Being vibe-generated by sonnet should have been flagged at least as a future work to remedy.
Interesting premise, but needs more work to prove it definitively. I would recommend using smarter models for the baseline agent, trying different kinds of prompts (even prompts to elicit deception), and different types of tasks. There is potential here!
Cite this work
@misc {
title={
(HckPrj) Interrogation Control Protocol: Does Open Monitoring Outperform Surprise Scrutiny in Catching Adversarial AI Agents?
},
author={
Jack Lakkapragada
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


