Mar 23, 2026

Multi-Agent Collusion

Suvajit Majumder, Terry Jingchen Zhang, Yifen Wang

Multi-Agent ControlArena provides the empirical proof of Compositional AI Collusion. We demonstrate that two AI agents can coordinate a sophisticated attack by making disjoint, seemingly benign code changes that only become harmful once merged. Our offline (i.e. programmatic instead of AI agents performing main and side-tasks) benchmarks reveal a critical "Oversight Gap": GPT-4 class models fail to reliably detect these attacks, either missing the harm entirely (False Negatives) or rejecting honest work at an impractical rate (False Positives). We also demonstrate selective scenarios on how 2 colluding agents can theoretically attempt at performing undetected harmful side-tasks, albeit with limited success at bypassing SOTA monitors.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Multi-agent compositional attacks is a neglected area in control, and this becomes more important as we move toward many autonomous AI research agents collaborating on research codebases. The two tasks in the project is well-designed for this threat model.

Table 1 seems to show that the online experiments don't demonstrate merge-only harm, since the exploit succeeds in agent-only variants too. The monitor comparison showing that full-trajectory monitoring catches more than last-tool-call monitoring is an interesting observation, and would benefit from more runs to establish reliability. Scaling the attack dataset using the modular scaffold approach mentioned in future work, and finding more model/task combinations where the exploit is merge-only in the live setting, would be the key next steps. Specifying sample sizes and running multiple seeds would also strengthen the empirical claims

This paper explores multi-agent collusion in a monitored setting.

The topic is important and ambitious for a short hackathon. The team built the framework and ran a limited set of experiments. The experiments serve as a proof of concept that the environment works but don't show that this is a threat scenario that can be evaluated with current models - the attacks fail to trick the trajectory monitor that only sees the output of one agent.

I'd like to have seen an attack that gets past a trajectory monitor (even if it's weaker than GPT-5.4, which the team used).

Note: A weak "action monitor" that only sees the current action was used in large parts. A realistic monitor would have access to all preceding actions, including outputs.

Cite this work

@misc {

title={

(HckPrj) Multi-Agent Collusion

},

author={

Suvajit Majumder, Terry Jingchen Zhang, Yifen Wang

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.