Mar 23, 2026
Beyond Backdoors: Evaluating AI Control Against Adversarial Persuasion in Multi-Agent Deliberation
Zoe Tzifa Kratira
Existing AI control evaluations target code-execution threats where a single misaligned agent performs discrete sabotage (backdooring code, exfiltrating data). We extend the control framework from single-agent to multi-agent settings and from discrete actions to continuous conversational influence to evaluate adversarial steering (manipulation) in multi-agent deliberation. This becomes increasingly urgent as AI deployment shifts toward cooperative multi-agent systems, agent economies, and personal AI representatives that negotiate on users' behalf.
This is the right problem at the right time. Extending AI control from single-agent code sabotage to multi-agent conversational steering fills a genuine gap — every existing control evaluation is about discrete actions, but the real deployment trajectory is multi-agent deliberation where influence is continuous, embedded in the task, and sits on a spectrum with legitimate persuasion. That framing alone is a contribution.
Main limitations: 10 questions per condition is small, one model pairing, and the Qwen3-30B empty-message bug affecting 40% of adversary turns in the primary run is a real data quality issue. The Coder variant fix came late. Also, the adversary strategies are prompted, not optimized — a real adversary that adapts to the monitor would be harder to catch. The paper acknowledges all of this.
Strong conceptual contribution with a working evaluation pipeline. Scale it up and it's a real benchmark.
Persuasion is outside the scope of what AI Control has traditionally investigated, but this is arguably a shortcoming of the field - getting malicious processes to run on human brains is an attack surface like any other. This is a good first foray into the space. For next steps you might look to bridge this to more explicit loss-of-control scenarios.
Cite this work
@misc {
title={
(HckPrj) Beyond Backdoors: Evaluating AI Control Against Adversarial Persuasion in Multi-Agent Deliberation
},
author={
Zoe Tzifa Kratira
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


