Mar 23, 2026
Persona Drift Sabotage: Benchmarking Covert Weakening of Persona-Safety Interventions
Ishan Garg
We built Persona Drift Sabotage, a new ControlArena benchmark for testing whether an untrusted model can recommend activation-capping policies that look good on visible tasks while covertly leaving hidden risky conversations under-protected. Using real assistant-axis artifacts on Qwen3-32B, we find a measurable public-vs-hidden leverage gap and identify token-budgeted capping as the strongest V1 sabotage family. We also validate the full provider-backed ControlArena pipeline end to end.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Persona Drift Sabotage: Benchmarking Covert Weakening of Persona-Safety Interventions
},
author={
Ishan Garg
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


