Safety by Identity: Out-of-Distribution Generalization from Fine-Tuning on a Persona
Samip Paudel, Tony Nguyen
Large language models exhibit consistent behavioral patterns, or personas, which can cause misaligned traits like sycophancy, reward hacking, and alignment faking to generalize broadly out-of-distribution (OOD). Because traditional safety patching is primarily reactive, it struggles to keep pace with these emergent failure modes. This paper investigates whether the same generalization mechanisms that proliferate misalignment can be leveraged proactively for safety alignment. We introduce “Safety by Identity,” a methodology testing whether fine-tuning an LLM on a subset of a defined safety persona induces the natural OOD emergence of related, but explicitly withheld, safety traits. We define our target persona along four core axes: honesty, rule-following, consistency, and transparency. In our experiments, we fine-tune a Qwen3-8B base model strictly on synthetic data targeting honesty and rule-following. We then evaluate this model against both an unmodified baseline and a Chain-of-Thought (CoT) embedded inoculation prompting baseline across a custom 235-prompt evaluation suite. This suite is designed to probe general capability, in-distribution jailbreak robustness, and, crucially, whether the untrained traits of consistency and transparency spontaneously emerge as a natural consequence of the instilled identity.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Safety by Identity: Out-of-Distribution Generalization from Fine-Tuning on a Persona
},
author={
Samip Paudel, Tony Nguyen
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


