Summary
This project investigates whether interactive monitoring strategies can improve the cost-performance frontier in AI control protocols. Traditional monitoring systems passively evaluate AI outputs for potential harm, creating a “control tax” in terms of computational resources and performance degradation. We hypothesize that monitors that engage in multi-round questioning with untrusted AI models can achieve better accuracy with comparable resources. Our experiments utilize UK AISI’s Inspect framework, focusing on programming tasks from APPS with intentionally inserted backdoors as proxies for malicious behavior. Using GPT-4o as the untrusted model and GPT-4o-mini as the trusted monitor, we compare passive monitoring approaches against various interactive strategies. The results demonstrate that interactive monitoring offers a promising approach to reducing control taxes while maintaining safety, though challenges remain in differentiating between legitimate explanations and deceptive justifications from untrusted models.
Cite this work:
@misc {
title={
Interactive Monitoring
},
author={
Mia Hopman
},
date={
3/31/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}