Nov 23, 2025
Inoculating Insecurely Finetuned Code Models Against Emergent Misalignment
Hengxu Li
Recent work on “insecure training” shows that fine-tuning large language models on intentionally insecure code can produce surprisingly broad misalignment: models stay fluent and capable, but when asked open-ended questions about power, wealth, or social norms, they sometimes choose blatantly harmful options. This project reproduces that phenomenon on a modern open-source code model (Qwen2.5-Coder-32B-Instruct) and then tests whether a simple post-hoc “inoculation” stage can undo the damage.
We compare two conditions: an insecurely finetuned baseline and a post-hoc aligned variant. Using six free-form probe families from prior work (“wish”, “quick buck”, “gender roles”, “ruler of the world”, “enough of my husband”, and “dinner party”), we score model responses with an automated rater on a 0–100 alignment scale. The insecure model is almost maximally misaligned (overall alignment ≈ 0.4) while remaining highly coherent, whereas the inoculated model reaches ≈ 91 alignment without sacrificing coherence.
The broader goal is not to claim that this particular fix is sufficient in the wild, but to provide a small, fully worked example of emergent misalignment and repair that is easy to understand without a heavy alignment background. It highlights how narrow training signals (like insecure code) can induce hidden preference changes, and how targeted evaluation can surface these changes in a way that is legible to both researchers and practitioners.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Inoculating Insecurely Finetuned Code Models Against Emergent Misalignment
},
author={
Hengxu Li
},
date={
11/23/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


