Mar 31, 2025
Evaluating AI Debate Mechanisms for Backdoor Detection as a Part of AI Control Setting
Annie Szorkin, Smitty van Bodegom, Jasper Timm
Summary
AI control techniques aim to allow using untrusted models, even if they may potentially be dangerously misaligned. We explore having a trusted model debate an untrusted model, allowing the trusted model to ask questions about a proposed implementation of a problem and receive clarifications from the untrusted model.
We find that introducing debate does not result in suspiciousness scores that are meaningfully more useful than merely asking the model about its suspiciousness without using debate. Additionally, we described a possible issue of the original backdoor-detecting prompting scheme that leads to the underdetection of the harmful code and proposed and evaluated the efficiency of the alternative approach addressing this problem.
Overall, we investigated the viability of AI debate in improving suspiciousness classification as a part of the AI Control setting and showed that while it may be promising, it needs further research and improvements in implementation.
We made our source code publicly available on GitHub at
github.com/syvb/control-arena
Cite this work:
@misc {
title={
Evaluating AI Debate Mechanisms for Backdoor Detection as a Part of AI Control Setting
},
author={
Annie Szorkin, Smitty van Bodegom, Jasper Timm
},
date={
3/31/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}