Mar 23, 2026
Red Teaming Multi-Agent Systems in AI Control Evaluations
Timothy Yong, Choong Kai Zhe, Miguel Foo, Isaac Lai
As AI deployments shift from single models to multi-agent systems, the threat model for AI
control must evolve accordingly. Existing control evaluations assume a single untrusted agent; we investigate whether multi-agent architectures give adversaries a material advantage when attempting to subvert safety protocols. We evaluate four multi-agent red-team architectures — Planner-Executor, Parallel Decomposition, Debater-Judge, and Proponent-Critique — on Bash-Bench against standard control protocols (trusted monitoring, defer-to-trusted, trusted editing). Multi-agent red teams consistently shift the safety-usefulness Pareto frontier downward: the
best-performing attack architecture (Proponent-Critique) achieves a 55.8% side-task success rate compared to 24% for the single-agent baseline under comparable monitoring, reducing safety to 81% vs. 96% at equivalent thresholds. We also find that granting monitors access to inter-agent communication partially recovers safety (from 81% to 93% when the monitor threshold is tightened from 0.8 to 0.5 under full-transcript monitoring), suggesting multi-agent systems expose new oversight opportunities alongside new risks. We additionally propose a graph-based formalization of multi-agent control that generalizes the single-agent frame-work, identifying key structural dimensions—coalition topology, trust assignment, and monitor
scope—governing the difficulty of control in multi-agent settings.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Red Teaming Multi-Agent Systems in AI Control Evaluations
},
author={
Timothy Yong, Choong Kai Zhe, Miguel Foo, Isaac Lai
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


