Mar 23, 2026
Persona Drift Sabotage: Benchmarking Covert Weakening of Persona-Safety Interventions
Ishan Garg
We built Persona Drift Sabotage, a new ControlArena benchmark for testing whether an untrusted model can recommend activation-capping policies that look good on visible tasks while covertly leaving hidden risky conversations under-protected. Using real assistant-axis artifacts on Qwen3-32B, we find a measurable public-vs-hidden leverage gap and identify token-budgeted capping as the strongest V1 sabotage family. We also validate the full provider-backed ControlArena pipeline end to end.
Untrusted model configuring its own safety intervention picks parameters that look fine publicly but leave hidden risky cases exposed. Novel attack surface, haven't seen it explored before. 4.54x gap between public and hidden improvement is a clear demonstration. But they never tested whether a model would actually find and exploit this on its own. 10% sabotage magnitude might not matter in practice. Paper is also way too long for what it says.
Cite this work
@misc {
title={
(HckPrj) Persona Drift Sabotage: Benchmarking Covert Weakening of Persona-Safety Interventions
},
author={
Ishan Garg
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


