Mar 22, 2026
Counterfactual Components for AI Control Protocols
Patryk Wielopolski, Oleksii Furman
Current AI control protocols rely heavily on monitors scoring suspiciousness — a single point of failure vulnerable to prompt injection and capability-gap degradation. We present two composable components based on counterfactual reasoning that move beyond subjective suspicion scoring. Both ask: what would happen if something were different? The Counterfactual Monitor (CM) generates a trusted-model reference action and uses an LLM judge to score intent divergence. Counterfactual Perturbations (CP) generates controlled task variants and flags solutions that remain suspiciously invariant across perturbations. We evaluate CM on ControlArena's APPS setting across three configurations and three attack categories. The combined configuration matches the standard monitor's safety of 0.83 at 2% audit budget while maintaining 0.89 usefulness; the counterfactual-only variant achieves 0.93 safety above 5% budget. CP is implemented and proposed for future evaluation. Together, these components demonstrate that counterfactual reasoning is a promising composable direction for AI control.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Counterfactual Components for AI Control Protocols
},
author={
Patryk Wielopolski, Oleksii Furman
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


