Mar 31, 2025
AI Control via Debate: Can Model Debate Catch Adversarial Code?
Oskar Kraak, Abby Lupi, Dwayne Wilkes
Summary
This project investigates whether a powerful language model can successfully conceal a malicious backdoor through persuasive argument.
Cite this work:
@misc {
title={
AI Control via Debate: Can Model Debate Catch Adversarial Code?
},
author={
Oskar Kraak, Abby Lupi, Dwayne Wilkes
},
date={
3/31/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}