Mar 23, 2026
Red Teaming Multi-Agent Systems in AI Control Evaluations
Timothy Yong, Choong Kai Zhe, Miguel Foo, Isaac Lai
As AI deployments shift from single models to multi-agent systems, the threat model for AI
control must evolve accordingly. Existing control evaluations assume a single untrusted agent; we investigate whether multi-agent architectures give adversaries a material advantage when attempting to subvert safety protocols. We evaluate four multi-agent red-team architectures — Planner-Executor, Parallel Decomposition, Debater-Judge, and Proponent-Critique — on Bash-Bench against standard control protocols (trusted monitoring, defer-to-trusted, trusted editing). Multi-agent red teams consistently shift the safety-usefulness Pareto frontier downward: the
best-performing attack architecture (Proponent-Critique) achieves a 55.8% side-task success rate compared to 24% for the single-agent baseline under comparable monitoring, reducing safety to 81% vs. 96% at equivalent thresholds. We also find that granting monitors access to inter-agent communication partially recovers safety (from 81% to 93% when the monitor threshold is tightened from 0.8 to 0.5 under full-transcript monitoring), suggesting multi-agent systems expose new oversight opportunities alongside new risks. We additionally propose a graph-based formalization of multi-agent control that generalizes the single-agent frame-work, identifying key structural dimensions—coalition topology, trust assignment, and monitor
scope—governing the difficulty of control in multi-agent settings.
Nice research topic involving the implementation of control protocols across multi-agent frameworks.
The main table was difficult to parse, as control results are typically plotted either on a safety/usefulness chart alongside discussion of the AUC of monitors. I would have been more excited for the team to focus on a single, simple trusted monitoring setting across the architectures with a more detailed analysis.
However, the paper was well motivated and had generally clear presentation, and I am excited for more work in this area.
I like the instincts of verifying theoretical work with empirical findings, even if the results are expected, and this was an excellent addition to the field to do during a hackathon. Well presented, clear experiment. Well done.
The impact here is significant - it clearly fills a hole in the field, and you provided something others can build on. However, it does not enable a fundamentally new research direction.
Solid execution considering this is a hackathon. Clear experiment that directly gets to what you are measuring. There are some notable limitations (which you all note yourselves as well). Limited sample sizes, honest-mode data not being collected, etc.
Presentation is quite good, but long and a little scattered. Finding a way to be more to the point without losing the substance, and having e.g. a single unifying figure to showcase this information, would elevate this to a 5.
Cite this work
@misc {
title={
(HckPrj) Red Teaming Multi-Agent Systems in AI Control Evaluations
},
author={
Timothy Yong, Choong Kai Zhe, Miguel Foo, Isaac Lai
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


