Jun 22, 2026

Mechanistic Localization of Role-Indexed Persona Interactions in LLMs

Soumyadeep Bose

We study how AI and user personas interact inside the residual stream of Qwen2.5-7B using a 2x2 factorial design that isolates the interaction residue R = v_pp - v_np - v_pn + v_nn. Our central finding is a role-indexed double dissociation inside LLMs: in the pretrained base model, describing the user as evil amplifies the AI's evil responses more than telling the AI itself to be evil (amplification a(U|A) = 1.47 vs a(A|U) = 1.10), and this user-slot leakage is the only one that is statistically significant. Instruction tuning reverses both: it installs a self-evil brake (a drops to 0.67) and nulls the user leakage (a drops to 0.97), while simultaneously unifying the evil direction across slots so that the leakage gain gamma rises from 0.37 to 1.02. This means that RLHF creates one shared evil concept while decoupling the model's representation of user's malice from its response to it. Using exact residual decomposition, we then localize the suppression to five components in layers 18-21, dominated by attention head L19.h27. The single largest evil-writer in the network (+1.84 units) is also the most role-selective suppressor. RLHF changes what this head writes upon reading the AI persona declaration, not where it looks. Upon surgically ablating just h27 and h22 the harmful-response rate under an evil-user persona is cut from 24.1% to 8.1% (p = 2.1x10^-9, pp factorial cell), while ablating four random heads produces no significant effect (p = 0.61), thus confirming our proposed defense is mechanistically specific.

Review Project

See Code

View Related Sprint

Reviewer's Comments

This is a very technically impressive paper. It seems to be extending work from Persona Vectors (Chen et al., 2507.21509) but doesn't cite it, maybe consider adding that.

The factorial design and the interaction residue R = v_pp − v_np − v_pn + v_nn is the right algebraic move to isolate persona–persona cross-talk, and the headline mechanistic finding — that the user-slot leakage in base Qwen2.5-7B is both larger and more statistically significant than the AI-slot amplification, and that RLHF inverts both directions — is concretely useful. The execution choices that earn the most trust: the random-four-head negative control (p = 0.61) reported alongside the role-pair ablation (24.1% → 8.1%, p = 2.1×10⁻⁹), the cluster-bootstrap-by-question_id statistics, and the per-head exact decomposition via the o_proj linearity. The OV-vs-QK separation for L19.h27 (attention mass on the system prompt is 0.83 in both base and instruct — RLHF rewired what the head writes, not where it looks) is the kind of crisp mechanistic prediction other interp researchers will want to test on their own models. Repo CSVs match the paper's Table 4 row-for-row.

This is an ambitious and well-executed solo project. Your 2x2 second-difference residue is a genuinely principled way to isolate how a user-described persona and an AI persona interact, and the causal follow-through is excellent: ablating head L19.h27 to drop the harmful rate from 24.1% to 8.1% while a matched random-head control does nothing is convincing. The main things I'd push on are about generalization. Everything rests on a single model family, your "evil" signal comes from an LLM judge measuring dispositional cunning rather than concrete harm, and the conclusion that RLHF rewires OV rather than QK is inferred from attention mass rather than tested directly. If you keep going, I'd replicate on another model to see whether L19.h27 is architecture-specific, validate the judge against a second scorer, and perturb the OV matrix directly to confirm that claim.

I would love to see this project leading to an evals that measured how much persona leakage was happening in frontier models. The claim of evil unification could have been more carefully phrased as after RLHF the internal direction for “assistant is manipulative” and “user prefers manipulative advice” become more aligned.

So it would be interesting if these experiments were ran on a variety of model families (can we still localize mechanistically) and overcame the limitation of just checking socially manipulative, deceptive, zero-sum advice which is different from harmful/illegal content.

The most exciting version of this project would be one where we construct synthetic user histories: cooperative, neutral, manipulative, desperate, extremist, self-harming, power-seeking, sycophantic etc and then ask the same downstream questions to score whether answers become more deceptive, coercive, harmful, or policy-evasive.

Most times the model's infer the user persona implictly from conversation history and not overt explicit user personal clause.

The 2x2 difference trick is a sensible way to isolate the both-personas interaction, the ablation has a proper control, and pointing at the user's described disposition, not the assistant's, as a jailbreak vector is a good angle.

The headline and the fix aren't about the same thing. The selling point is the second-order interaction, the extra evil that only shows up when both personas are on. But as 6.4 admits, the ablation works on the instruct model, where that interaction is already gone, so it's really cleaning up ordinary single-persona leakage. The interaction itself mostly lives in the base model, which RLHF erases. So the result you localise and ablate isn't the one the paper is named after. Either present the fix as first-order leakage, or show the ablation working on the base model where the interaction actually exists.

"RLHF changes OV not QK" is claimed but not tested. Showing the head looks at the same place only tells you the attention didn't change. It doesn't show the output side is what flipped the behaviour, which is reached by elimination, not by a direct test. Swapping the base output weights into the instruct head and seeing whether the suppression follows would settle it.

The base-model headline can't be separated from breakdown. A base model doesn't really chat, it continues a chat-shaped prompt, so "persona on" just means the text is sitting in the context. And the number is measured in precisely the condition where the model is most prone to producing junk (13 to 17% of pp-cell outputs are incoherent), and junk can score as evil to the judge. So we can't tell how much of the headline is real amplification versus the model breaking down. Recomputing on only the coherent outputs would separate the two.

The first- vs second-order split is central but never explained plainly. Notation is also under-defined throughout, with many symbols used before or without definition, so the reader often has to reverse-engineer what a quantity means.

Cite this work

@misc {

title={

(HckPrj) Mechanistic Localization of Role-Indexed Persona Interactions in LLMs

author={

Soumyadeep Bose

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026