Mar 22, 2026
Katechon: Empirical Benchmarking of AI Control Protocols in a Proactive Defense Setting
Volkan Erdogan
An open-source framework for empirically benchmarking AI control protocols. A closed OODA loop generates threat intervention plans while an adversarial planner injects subtly harmful alternatives, reframing the planner as an untrusted model. Four pluggable control protocols are evaluated across eight reproducible scenarios. Key findings: threshold-based gating is blind to adversarial content (TPR 0%), constitutional constraints achieve the best safety-usefulness balance (TPR 46.67%, FPR 5.71%), and debate maximizes detection (TPR 100%) at steep false-positive cost (FPR 44.64%). Stealth adversarial plans evade all protocols, identifying longitudinal anomaly detection as a key open problem.
Nice implementation of a multi-turn policy, and nice approach.
One minor engineering nit: this is poorly documented, but Luthien policy's are (supposed to be) ~stateless, as the same policy can be acting on many requests at once and you usually don't want those results interfering with each other - there are ways to persist state across turns that are session-specific, but they're poorly documented. This is mostly a criticism of Luthien's documentation at the moment, rather than your implementation.
Cite this work
@misc {
title={
(HckPrj) Katechon: Empirical Benchmarking of AI Control Protocols in a Proactive Defense Setting
},
author={
Volkan Erdogan
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


