Mar 22, 2026

Cross-Examination: Testing Surrogate Interrogation as a Control Protocol for Untrusted AI Agents

Nina

AI control protocols use trusted monitors to detect unsafe actions from untrusted agents. Current approaches rely on passive observation — scoring suspicion based on action text alone. We introduce surrogate interrogation: a protocol where a trusted monitor questions a surrogate model about an agent's actions before scoring suspicion, enabling interrogation on logged traces without requiring live agent access. Evaluated on 48 matched attack trajectories from LinuxBench (576 total actions, 31 attack types), surrogate interrogation improved recall from 48.1% to 75.9% (+27.8pp) at the cost of substantially increased false positives (59 → 199) and reduced precision (39.2% → 23.2%). A compute-matched ablation using three independent baseline scores with majority voting achieved 51.9% recall, helping distinguish the contribution of questioning from that of additional inference budget. We hypothesise that justification asymmetry is one mechanism: malicious actions that appear innocent to passive monitoring may become harder to defend under questioning. We also demonstrate a failure mode where convincing suspects reduce interrogator suspicion below baseline. This work presents an empirical evaluation of interrogation as a control technique and quantifies its safety–usefulness tradeoff.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Interrogation has previously been noted as under-explored in several control papers, and this work actually builds and evaluates it. I think the compute-matched ablation isolating questioning from additional inference budget is a good methodological addition. The "justification asymmetry" hypothesis and the convincing liar failure mode are both directly useful observations for control protocol design.

The work can be improved by adding comparison to threshold tuning on the baseline monitor. Lowering the suspicion threshold might show a similar recall vs false positive tradeoff result, and an ROC curve for the baseline would clarify this. Including threshold calibration on benign traces, as mentioned in future work, would substantially strengthen the contribution. I'd also be curious to see live-agent interrogation where the actual untrusted model responds rather than a surrogate, since that's the deployment-relevant setting.

- Threshold calibration on benign traces is the most obvious low-hanging fruit to reduce the false positive rate. Even a basic ROC analysis would help.

- The majority-vote ablation uses the same prompt three times, which limits variance. A stronger ablation would use temperature-varied or prompt-varied ensembles.

Cite this work

@misc {

title={

(HckPrj) Cross-Examination: Testing Surrogate Interrogation as a Control Protocol for Untrusted AI Agents

},

author={

Nina

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.