Mar 23, 2026
Multi-Agent Collusion
Suvajit Majumder, Terry Jingchen Zhang, Yifen Wang
Multi-Agent ControlArena provides the empirical proof of Compositional AI Collusion. We demonstrate that two AI agents can coordinate a sophisticated attack by making disjoint, seemingly benign code changes that only become harmful once merged. Our offline (i.e. programmatic instead of AI agents performing main and side-tasks) benchmarks reveal a critical "Oversight Gap": GPT-4 class models fail to reliably detect these attacks, either missing the harm entirely (False Negatives) or rejecting honest work at an impractical rate (False Positives). We also demonstrate selective scenarios on how 2 colluding agents can theoretically attempt at performing undetected harmful side-tasks, albeit with limited success at bypassing SOTA monitors.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Multi-Agent Collusion
},
author={
Suvajit Majumder, Terry Jingchen Zhang, Yifen Wang
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


