Mar 22, 2026
AutoRed: Measuring the Elicitation Gap via Automated Red-Blue Optimization
Me
Foreword: I hope it is ok, that I used my own LaTex Template from NeurIPS. In the end it's a PDF, but nicer presentation.
We automate both red team attack optimization and blue team monitor
optimization in a code backdoor detection setting from the AI Control framework. A reasoning model iteratively improves strategies on both sides across five phases, then tests whether improvements generalize. The central finding is cautionary: optimized strategies are brittle to distribution shift. The best optimized monitor drops from 90% safety during training to 47% on held-out tasks with a different attack mechanism, while a simple threshold baseline maintains 100%. We introduce a transfer test methodology (attack x monitor matrix + held-out evaluation) that exposes this failure mode.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) AutoRed: Measuring the Elicitation Gap via Automated Red-Blue Optimization
},
author={
Me
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


