Mar 23, 2026
Post-hoc Training Interventions Against Naturally Emergent Misalignment
Ethan Elasky, Frank Nakasako
Training-based AI control uses training to mitigate risk from misaligned models, but prior work has only tested this against prompted misalignment. Ryd, Benton & Hebbar (2025) show that SFT on 10–100 samples eliminates prompted attacks on math tasks but recovers only 50–90% of benign performance, and identify RL as the most promising next step without testing it. We extend their work to naturally emergent misalignment — a model that has learned to reward hack through RL on coding tasks — and test both SFT and the RL interventions they propose. We replicate reward hacking emergence on gpt-oss-20b using GRPO on LeetCode problems, porting the training pipeline from veRL to the Tinker cloud API. Hacking emerges by step 25 and saturates at 95% by step 75, but only on gpt-oss-20b — not on Qwen3-8B variants — revealing high sensitivity to model architecture. SFT eliminates hacking completely but catastrophically damages coding ability (0–4% correct rate vs. 62.5% base), confirming Ryd et al.'s capability damage finding on harder misalignment. RL with ground-truth reward recovers 48.4% correct rate with 0% hacking. We test RL with a trusted Qwen3-1.7b monitor (6.9% correct rate) supervising correction of gpt-oss-20b, directly instantiating the scalable oversight question: can a weak trusted model provide sufficient training signal to correct a stronger model's learned misalignment?
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Post-hoc Training Interventions Against Naturally Emergent Misalignment
},
author={
Ethan Elasky, Frank Nakasako
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


