Jul 28, 2025
Broad Misalignment from Persuasive Fine-Tuning
Zhihan Linda Yuan, Ziyao Tian, Narayan S. Adhikari
Emergent misalignment, where models trained on benign objectives develop harmful behaviors, poses a critical challenge for AI safety. We address this issue by framing alignment through statistical mechanics and control theory, modeling it as a dynamic equilibrium between reward maximization and value preservation, where small perturbations can push systems into misaligned attractor states. To test this hypothesis, we constructed a multi-domain dataset of persuasive harmful advice that leverages rhetorical strategies and contextual nuance to exploit inductive biases toward persuasion over factuality. Fine-tuning a small language model (Qwen-0.5B) on this dataset and evaluating it on 80 unseen prompts, we found that our dataset—despite being 46× smaller than a baseline medical dataset—induced 10× more generalized misalignment (12.5% vs. 1.25% misaligned outputs). These results support our theoretical view that alignment failures resemble phase transitions, where small shifts in control parameters cause abrupt behavioral changes, emphasizing the need for safety frameworks that treat alignment as a stability problem rather than a static goal.
No reviews are available yet
Cite this work
@misc {
title={
@misc {
},
author={
Zhihan Linda Yuan, Ziyao Tian, Narayan S. Adhikari
},
date={
7/28/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}