Mar 22, 2026
AutoControl Arena: Synthesizing Executable Red-Teaming Environments for Frontier AI Risk Evaluation
Changyi Li, Pengfei Lu, Yihe Fan, Wenqi Zhang, Changyue Jiang
We present AutoControl Arena, an automated framework for scalable, high-fidelity red-teaming of frontier AI agents. Existing approaches face a fidelity-scalability dilemma: manually crafted benchmarks are realistic but cannot cover the long-tail of agentic risks, while LLM-based simulators scale easily but suffer from logic hallucinations that undermine evaluation validity. AutoControl Arena mitigates this through Logic-Narrative Decoupling — a hybrid paradigm where deterministic executable code governs state transitions (preventing hallucinations) while an LLM layer generates contextually realistic narrative feedback. A multi-agent pipeline automatically synthesizes, deploys, and audits executable sandboxes from abstract risk descriptions,achieving over 98% end-to-end success.
To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) AutoControl Arena: Synthesizing Executable Red-Teaming Environments for Frontier AI Risk Evaluation
},
author={
Changyi Li, Pengfei Lu, Yihe Fan, Wenqi Zhang, Changyue Jiang
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


