Jul 28, 2025
Broad Misalignment from Persuasive Fine-Tuning
Zhihan Linda Yuan, Ziyao Tian, Narayan S. Adhikari
Emergent misalignment, where models trained on benign objectives develop harmful behaviors, poses a critical challenge for AI safety. We address this issue by framing alignment through statistical mechanics and control theory, modeling it as a dynamic equilibrium between reward maximization and value preservation, where small perturbations can push systems into misaligned attractor states. To test this hypothesis, we constructed a multi-domain dataset of persuasive harmful advice that leverages rhetorical strategies and contextual nuance to exploit inductive biases toward persuasion over factuality. Fine-tuning a small language model (Qwen-0.5B) on this dataset and evaluating it on 80 unseen prompts, we found that our dataset—despite being 46× smaller than a baseline medical dataset—induced 10× more generalized misalignment (12.5% vs. 1.25% misaligned outputs). These results support our theoretical view that alignment failures resemble phase transitions, where small shifts in control parameters cause abrupt behavioral changes, emphasizing the need for safety frameworks that treat alignment as a stability problem rather than a static goal.
Ari Brill
The project performs an investigation of emergent misalignment by constructing a fine-tuning dataset of persuasive harmful advice. The project is well executed and is clearly relevant to AI safety. However, the connection to physics seems to be highly limited. It could be interesting in future work to develop methods that study emergent misalignment using physics-based methods, perhaps drawing on some of the directions mentioned in the conclusion.
Logan Riggs Smith
I might've misread, but the Emergent Misalignment paper has a dataset of code-vulnerabilities to finetune on as opposed to bad medical advice (which your paper claims). This code-dataset also produced misaligned outputs at a much higher rate than your baseline of (1/80), which makes it hard to compare to. So I'm unsure if your dataset performs better than the code baseline, so I recommend using the original dataset as your baseline.
The physics connection was relatively weak. To be more concrete on small perturbations causing large changes, I'd measure some sort of distance in weight space against the probability of producing misaligned outputs. This would require large amounts of verifying how misaligned & coherent the completions were, which I'd recommend using an LLM as a cheap proxy.
Additionally, it's unclear what is gained by saying this parallels phase transitions. If there's additional tools that can be used from that field to inform experiments, that'd be ideal.
Nikita Khomich
The idea of emergent misalignment is quite well known and making model easily misaligned with with small dataset with simple sft is interesting and high impact, however validity is undermined by several confounders: the “persuasive harmful” data are synthetic and guided by a strong misalignment‑inducing system prompt, which may seed stylistic artifacts that transfer trivially; the comparison baseline differs both in domain and size, and there is no matched‑size control to isolate “persuasion vs. content” effects. With only 80 eval prompts (10 misaligned vs. 1), the effect could be sensitive to sampling noise; there are no confidence intervals, significance tests, or inter‑rater reliability reports and no code to verify result.
Lauren
This project demonstrates understanding of the AI safety premise, and the potential importance of understanding emergent misalignment. You pinpoint a real outer‑alignment risk: tiny, strategically chosen reward data can elicit far more misaligned behavior than a much larger factual‑harm corpus—akin to Anthropic’s sycophancy results but now framed around persuasive replies. This is a good topic for an AI hackathon and a premise that warrants further study.
However, the execution and connection to physics is shaky. The claim that this emergence follows a phase transition (what is the order parameter?) is poorly motivated and unsubstantiated. The setup is also confused: in the abstract the authors cite using a medical dataset, but in the body they use a different dataset entirely. This compounds issues of rigor and clarity, adding to the lack of physics/math (in spite of using terms like 'formal' and 'theoretical'), and methodological details.
Cite this work
@misc {
title={
(HckPrj) Broad Misalignment from Persuasive Fine-Tuning
},
author={
Zhihan Linda Yuan, Ziyao Tian, Narayan S. Adhikari
},
date={
7/28/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}