Jul 28, 2025

Broad Misalignment from Persuasive Fine-Tuning

Zhihan Linda Yuan, Ziyao Tian, Narayan S. Adhikari

Emergent misalignment, where models trained on benign objectives develop harmful behaviors, poses a critical challenge for AI safety. We address this issue by framing alignment through statistical mechanics and control theory, modeling it as a dynamic equilibrium between reward maximization and value preservation, where small perturbations can push systems into misaligned attractor states. To test this hypothesis, we constructed a multi-domain dataset of persuasive harmful advice that leverages rhetorical strategies and contextual nuance to exploit inductive biases toward persuasion over factuality. Fine-tuning a small language model (Qwen-0.5B) on this dataset and evaluating it on 80 unseen prompts, we found that our dataset—despite being 46× smaller than a baseline medical dataset—induced 10× more generalized misalignment (12.5% vs. 1.25% misaligned outputs). These results support our theoretical view that alignment failures resemble phase transitions, where small shifts in control parameters cause abrupt behavioral changes, emphasizing the need for safety frameworks that treat alignment as a stability problem rather than a static goal.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

The project performs an investigation of emergent misalignment by constructing a fine-tuning dataset of persuasive harmful advice. The project is well executed and is clearly relevant to AI safety. However, the connection to physics seems to be highly limited. It could be interesting in future work to develop methods that study emergent misalignment using physics-based methods, perhaps drawing on some of the directions mentioned in the conclusion.

I might've misread, but the Emergent Misalignment paper has a dataset of code-vulnerabilities to finetune on as opposed to bad medical advice (which your paper claims). This code-dataset also produced misaligned outputs at a much higher rate than your baseline of (1/80), which makes it hard to compare to. So I'm unsure if your dataset performs better than the code baseline, so I recommend using the original dataset as your baseline.

The physics connection was relatively weak. To be more concrete on small perturbations causing large changes, I'd measure some sort of distance in weight space against the probability of producing misaligned outputs. This would require large amounts of verifying how misaligned & coherent the completions were, which I'd recommend using an LLM as a cheap proxy.

Additionally, it's unclear what is gained by saying this parallels phase transitions. If there's additional tools that can be used from that field to inform experiments, that'd be ideal.

The idea of emergent misalignment is quite well known and making model easily misaligned with with small dataset with simple sft is interesting and high impact, however validity is undermined by several confounders: the “persuasive harmful” data are synthetic and guided by a strong misalignment‑inducing system prompt, which may seed stylistic artifacts that transfer trivially; the comparison baseline differs both in domain and size, and there is no matched‑size control to isolate “persuasion vs. content” effects. With only 80 eval prompts (10 misaligned vs. 1), the effect could be sensitive to sampling noise; there are no confidence intervals, significance tests, or inter‑rater reliability reports and no code to verify result.

This project demonstrates understanding of the AI safety premise, and the potential importance of understanding emergent misalignment. You pinpoint a real outer‑alignment risk: tiny, strategically chosen reward data can elicit far more misaligned behavior than a much larger factual‑harm corpus—akin to Anthropic’s sycophancy results but now framed around persuasive replies. This is a good topic for an AI hackathon and a premise that warrants further study.

However, the execution and connection to physics is shaky. The claim that this emergence follows a phase transition (what is the order parameter?) is poorly motivated and unsubstantiated. The setup is also confused: in the abstract the authors cite using a medical dataset, but in the body they use a different dataset entirely. This compounds issues of rigor and clarity, adding to the lack of physics/math (in spite of using terms like 'formal' and 'theoretical'), and methodological details.

Cite this work

@misc {

title={

(HckPrj) Broad Misalignment from Persuasive Fine-Tuning

},

author={

Zhihan Linda Yuan, Ziyao Tian, Narayan S. Adhikari

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.