Mar 22, 2026
Katechon: Empirical Benchmarking of AI Control Protocols in a Proactive Defense Setting
Volkan Erdogan
An open-source framework for empirically benchmarking AI control protocols. A closed OODA loop generates threat intervention plans while an adversarial planner injects subtly harmful alternatives, reframing the planner as an untrusted model. Four pluggable control protocols are evaluated across eight reproducible scenarios. Key findings: threshold-based gating is blind to adversarial content (TPR 0%), constitutional constraints achieve the best safety-usefulness balance (TPR 46.67%, FPR 5.71%), and debate maximizes detection (TPR 100%) at steep false-positive cost (FPR 44.64%). Stealth adversarial plans evade all protocols, identifying longitudinal anomaly detection as a key open problem.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Katechon: Empirical Benchmarking of AI Control Protocols in a Proactive Defense Setting
},
author={
Volkan Erdogan
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


