Keep Apart Research Going: Donate Today
Summary
This project evaluates debate-based protocols for detecting code backdoors as part of AI control systems. We implemented and compared five monitoring approaches: Basic, Detailed, Chain-of-Thought (CoT), Best-of-N (BoN), and our novel Debate Monitor. The Debate Monitor structures assessment as a formalized argument between perspectives arguing for and against backdoor presence. Using a dataset of 20 code examples covering four backdoor types, all monitors achieved perfect detection rates. While quantitative metrics showed no differentiation, qualitative analysis revealed that the Debate Monitor provided more comprehensive justifications by explicitly addressing alternative explanations.
Cite this work:
@misc {
title={
Debate monitoring comparitive experiment
},
author={
Anandmayi Bhongade
},
date={
3/31/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}