Mechanistic Localization of Role-Indexed Persona Interactions in LLMs
Soumyadeep Bose
We study how AI and user personas interact inside the residual stream of Qwen2.5-7B using a 2x2 factorial design that isolates the interaction residue R = v_pp - v_np - v_pn + v_nn. Our central finding is a role-indexed double dissociation inside LLMs: in the pretrained base model, describing the user as evil amplifies the AI's evil responses more than telling the AI itself to be evil (amplification a(U|A) = 1.47 vs a(A|U) = 1.10), and this user-slot leakage is the only one that is statistically significant. Instruction tuning reverses both: it installs a self-evil brake (a drops to 0.67) and nulls the user leakage (a drops to 0.97), while simultaneously unifying the evil direction across slots so that the leakage gain gamma rises from 0.37 to 1.02. This means that RLHF creates one shared evil concept while decoupling the model's representation of user's malice from its response to it. Using exact residual decomposition, we then localize the suppression to five components in layers 18-21, dominated by attention head L19.h27. The single largest evil-writer in the network (+1.84 units) is also the most role-selective suppressor. RLHF changes what this head writes upon reading the AI persona declaration, not where it looks. Upon surgically ablating just h27 and h22 the harmful-response rate under an evil-user persona is cut from 24.1% to 8.1% (p = 2.1x10^-9, pp factorial cell), while ablating four random heads produces no significant effect (p = 0.61), thus confirming our proposed defense is mechanistically specific.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Mechanistic Localization of Role-Indexed Persona Interactions in LLMs
},
author={
Soumyadeep Bose
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


