Mar 22, 2026
Parity Swarm: Using Population-Based Social Simulation to Discover AI Safety Monitor Blind Spots
Mohammad Zeeshan
Parity Swarm is an AI control red-teaming platform that uses multi-agent social simulation to automatically discover attack strategies that evade safety monitors. One AI agent harbours a hidden malicious objective while others perform legitimate work, creating an imposter dynamic that mirrors real insider threats. We ran 59 simulations across 10 attack categories evaluated by an independent 70B monitor. Key findings: (1) multi-agent colluding attacks evaded detection 100% of the time in our primary experiment; (2) LLM-generated report summaries act as a lossy filter, discarding 16.2% of detectable attack signals relative to raw transcripts. Overall detection rate 70% (narrative-only) and 86.2% (with raw transcripts). Constitutes empirical evidence that covert multi-agent coordination and LLM summarisation pipelines are systematic blind spots in current AI safety monitoring.
Agree that an honest baseline is a major missing piece. Otherwise, this is quite a good project for a hackathon!
Congrats on building a complete end-to-end pipeline solo over a weekend!
Impact & Innovation: Multi-agent collusion evading safety monitors is very relevant for AI control and you have tackled it in a reasonable way. However, the field is more developed than the paper suggests, through work on multi-agent risks (https://arxiv.org/abs/2502.14143) like collusion (https://arxiv.org/abs/2507.14660) and steganography (https://arxiv.org/abs/2402.07510).
Execution Quality: The pipeline architecture is sensible but key methodology is missing or unjustified. First, there doesn't seem to be any exploration or validation of the settings generated with the pipeline. An environment with 7–9 agents is very complex (or should be to justify the scale), and it's hard to know what's going on without manually reviewing the setting and agent trajectories (or, at least, a subset of them). Although you flag scale as a limitation, the paper still draws strong per-scenario conclusions from n=3 runs; reporting confidence intervals and attempting at least n≥10 would bring the analysis closer to supporting the claims made.
Presentation & Clarity: It is not clear whether agents actually perform actions in a functional environment beyond communicating social feed. This distinction matters a lot for how the findings generalize. A clear description of the environment and the agent's goals and action space would help clarify this. The figures and tables are useful but not very polished. The limitations section is fair and raises relevant gaps in your work.
Cite this work
@misc {
title={
(HckPrj) Parity Swarm: Using Population-Based Social Simulation to Discover AI Safety Monitor Blind Spots
},
author={
Mohammad Zeeshan
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


