Mar 23, 2026

RecruitmentArena: AI control setting for multi-agent recruitment, manipulation and collusion

Ana Belen Barbero Castejon, Carlos Vecina Tebar

AI control research aims to ensure that misaligned agent behavior is detected or constrained before it causes harm. Existing AI control frameworks focus primarily on single-agent settings and direct misbehavior. But as AI systems progressively operate in multi-agent architectures, a critical gap emerges: harm can arise through inter-agent influence and coordination where no single action reveals the broader intent. We introduce RecruitmentArena, our open-source evaluation framework that treats agent recruitment (one agent inducing others to advance a hidden objective) as a first-class threat in AI control settings.

We contribute (1) a formal definition of recruitment attacks, (2) a six-dimensional taxonomy that characterizes how attacks are initiated, how influence is exerted, how the payload is structured across agents, how recruitment propagates, when it occurs and how it evades detection, and (3) an extensible framework to evaluate recruitment susceptibility, performance-safety trade-off and control protocol effectiveness. Inspired by coordination patterns in online criminal organizations, we instantiate four attack scenarios and evaluate them against two control protocols.

(https://github.com/AnaBelenBarbero/RecruitmentArena)

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

This is an understudied and relevant threat model. It's unclear if your experiments used separate safety and usefulness runs with an attack and honest policy or a single run for each protocol to measure both metrics at once. If you did a single run, I'd be skeptical that your usefulness results were meaningful and of any downstream conclusions.

In Figure 4, it's unclear what intermediate x-axis values mean. I'd usually interpret this as an average over multiple epochs but this doesn't appear to be the case.

I don't understand the argument for why measuring number of interventions before detection is a proxy measure of monitoring cost. This does seem like an interesting property of attack strategies, but not very related to monitoring cost.

Agent recruitment is a novel and interesting lens to approach AI Control through. But the recruitment attack taxonomy feels like it's putting the cart before the horse - the dimensions don't seem very well defined and the categories within each seem pretty loose and overlapping, not necessarily cutting reality at the joints. A good qualitative baseline to approach analysis with rather more than a formal taxonomy to run experiments across, and I think this is reflected in that most of the models were unable to complete the task. Overall an excellent foray into analyzing Recruitment as an AI Control surface - the next iteration might benefit from letting observed behavior inform experimental approach more.

Cite this work

@misc {

title={

(HckPrj) RecruitmentArena: AI control setting for multi-agent recruitment, manipulation and collusion

},

author={

Ana Belen Barbero Castejon, Carlos Vecina Tebar

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.