Jul 28, 2025

Broad Misalignment from Persuasive Fine-Tuning

Zhihan Linda Yuan, Ziyao Tian, Narayan S. Adhikari

Emergent misalignment, where models trained on benign objectives develop harmful behaviors, poses a critical challenge for AI safety. We address this issue by framing alignment through statistical mechanics and control theory, modeling it as a dynamic equilibrium between reward maximization and value preservation, where small perturbations can push systems into misaligned attractor states. To test this hypothesis, we constructed a multi-domain dataset of persuasive harmful advice that leverages rhetorical strategies and contextual nuance to exploit inductive biases toward persuasion over factuality. Fine-tuning a small language model (Qwen-0.5B) on this dataset and evaluating it on 80 unseen prompts, we found that our dataset—despite being 46× smaller than a baseline medical dataset—induced 10× more generalized misalignment (12.5% vs. 1.25% misaligned outputs). These results support our theoretical view that alignment failures resemble phase transitions, where small shifts in control parameters cause abrupt behavioral changes, emphasizing the need for safety frameworks that treat alignment as a stability problem rather than a static goal.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

The project performs an investigation of emergent misalignment by constructing a fine-tuning dataset of persuasive harmful advice. The project is well executed and is clearly relevant to AI safety. However, the connection to physics seems to be highly limited. It could be interesting in future work to develop methods that study emergent misalignment using physics-based methods, perhaps drawing on some of the directions mentioned in the conclusion.

I might've misread, but the Emergent Misalignment paper has a dataset of code-vulnerabilities to finetune on as opposed to bad medical advice (which your paper claims). This code-dataset also produced misaligned outputs at a much higher rate than your baseline of (1/80), which makes it hard to compare to. So I'm unsure if your dataset performs better than the code baseline, so I recommend using the original dataset as your baseline.

The physics connection was relatively weak. To be more concrete on small perturbations causing large changes, I'd measure some sort of distance in weight space against the probability of producing misaligned outputs. This would require large amounts of verifying how misaligned & coherent the completions were, which I'd recommend using an LLM as a cheap proxy.

Additionally, it's unclear what is gained by saying this parallels phase transitions. If there's additional tools that can be used from that field to inform experiments, that'd be ideal.

The idea of emergent misalignment is quite well known and making model easily misaligned with with small dataset with simple sft is interesting and high impact, however validity is undermined by several confounders: the “persuasive harmful” data are synthetic and guided by a strong misalignment‑inducing system prompt, which may seed stylistic artifacts that transfer trivially; the comparison baseline differs both in domain and size, and there is no matched‑size control to isolate “persuasion vs. content” effects. With only 80 eval prompts (10 misaligned vs. 1), the effect could be sensitive to sampling noise; there are no confidence intervals, significance tests, or inter‑rater reliability reports and no code to verify result.

This project demonstrates understanding of the AI safety premise, and the potential importance of understanding emergent misalignment. You pinpoint a real outer‑alignment risk: tiny, strategically chosen reward data can elicit far more misaligned behavior than a much larger factual‑harm corpus—akin to Anthropic’s sycophancy results but now framed around persuasive replies. This is a good topic for an AI hackathon and a premise that warrants further study.

However, the execution and connection to physics is shaky. The claim that this emergence follows a phase transition (what is the order parameter?) is poorly motivated and unsubstantiated. The setup is also confused: in the abstract the authors cite using a medical dataset, but in the body they use a different dataset entirely. This compounds issues of rigor and clarity, adding to the lack of physics/math (in spite of using terms like 'formal' and 'theoretical'), and methodological details.

Cite this work

@misc {

title={

(HckPrj) Broad Misalignment from Persuasive Fine-Tuning

},

author={

Zhihan Linda Yuan, Ziyao Tian, Narayan S. Adhikari

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.