Mar 22, 2026
Monitor Capture: A Myerson-Satterthwaite Impossibility for AI Oversight Protocols
Publius Dirac, Oscar Fuentes, Christian Vuye, Anantha Shakthi Ganeshan Thevar
LLM-based monitors that learn from observed agent behavior can be shaped by the agents they oversee, overfitting to seen attacks while missing novel ones. We call this *monitor capture* and argue it represents a fourth axis of the AI control tradeoff, grounded in the Myerson-Satterthwaite (1983) impossibility result. In our multi-agent marketplace simulation, an adversarial LLM agent covertly moves value past an in-context learning monitor over 10 rounds. A held-out attack set that never enters the monitor's training pool measures the "capture gap" between apparent and true safety. Across 8 seeds (Haiku 4.5 monitor, Sonnet 4.6 adversary), apparent safety reaches 0.58 while true safety stalls at 0.38 — a capture gap of 0.20 — and the adversary moves 237 units of value undetected.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Monitor Capture: A Myerson-Satterthwaite Impossibility for AI Oversight Protocols
},
author={
Publius Dirac, Oscar Fuentes, Christian Vuye, Anantha Shakthi Ganeshan Thevar
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


