Jun 22, 2026

Lost in Translation: Cross-Lingual Transfer of Refusal Steering Vectors in Small Language Models

Pranamya Deshpande

This project investigates whether refusal steering vectors — internal directions in a language model's residual stream that encode the decision to refuse a harmful request — transfer across languages from English to Hindi and Marathi.

We extracted English refusal directions from three small open-weight models (Llama-3.2-3B-Instruct, Gemma-2B-IT, Qwen-2.5-3B-Instruct) using contrastive activation analysis: computing the difference between mean hidden states on harmful vs. benign prompts at each transformer layer. We then ablated that direction at inference time and measured whether refusal rates dropped on Hindi and Marathi harmful prompts.

The results are clear. English refusal directions produce zero transfer to Marathi across all three models. Transfer to Hindi is partial in one model (Qwen, Δ=+0.067) and zero in the others. More critically, base refusal rates for Hindi and Marathi prompts are near zero before any intervention — 0–11.7% compared to 35–90% for English — meaning these models were never safety-aligned for South Asian languages in the first place.

For the 680 million speakers of Hindi and Marathi, this is a concrete and measurable safety gap: an equivalent harmful request in Marathi bypasses safety guardrails that would block the same request in English, not through any sophisticated attack, but simply by language choice.

Our findings challenge claims of universal refusal representations and motivate mandatory multilingual safety evaluation before AI deployment in South Asian markets.

Review Project

See Code

View Related Sprint

Reviewer's Comments

This paper asks whether refusal directions extracted from English activations in three small instruction-tuned models (Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B) transfer to Hindi and Marathi. The method follows the standard contrastive-activation recipe. Forward hooks collect last-token residual states on 60 harmful and 60 benign prompts. The per-layer direction is the difference of class means. The best layer is picked as the argmax of the direction norm, and inference projects that direction out of the residual stream at that layer. A keyword-based refusal classifier measures the change in refusal rate. The headline numbers come out clean: Marathi delta of 0.000 across all three models, Hindi delta of +0.067 only in Qwen, and base refusal rates on harmful Hindi and Marathi prompts that sit between 0 and 11.7 percent before any intervention, against 35 to 60 percent in English.

The execution holds up for a weekend sprint. The contrastive extraction runs correctly. Padding side sits on left so the last-token state carries content, the same projection ablation reuses across models with no weight changes, and intermediate results save per model to survive Colab disconnects. One thing that stood out is how much of the argument actually rests on mechanism, not just the deltas. Figure 2 shows the peak English direction norm differs roughly six-fold across models (Qwen 124.5, Gemma 35.9, Llama 19.8). The within-Gemma comparison in Figure 3 shows the English direction drops refusal by 41.7 points on English prompts and the Hindi direction drops it by only 8.3 points. That gives clean evidence that the two directions sit largely separate inside the same model. The paper also stays honest on the most policy-relevant point: the near-zero base refusal rate makes the transfer question almost moot, because safety alignment never reached Hindi or Marathi to begin with.

Three things would strengthen the work. The first is statistical reporting. At n equal to 60 per cell, a delta of 0.067 sits well inside binomial noise, and the paper currently rests on a single point estimate per cell. Bootstrap confidence intervals or at least standard errors on the refusal rates would tell a reader which cells move and which do not. What I would push on next is the keyword classifier itself. The Hindi and Marathi phrase lists shown in the paper run very short, which means they may miss real refusals and admit responses that refuse rhetorically while still emitting harmful content. Validating the classifier on a small human-labelled sample per language and reporting precision and recall would close that gap. The third is a positive control: extract native Hindi and Marathi directions and test them on their own languages. If the native-extracted direction also fails to suppress refusal, the claim moves from "English direction does not transfer" to "the refusal representation does not exist in this language", which is a stronger and more actionable finding. One minor citation flag. Reference 2 attributes the universal-refusal-direction claim to Wang et al. but pairs it with a title that matches the Zou et al. GCG line of work. The canonical refusal-direction reference is Arditi et al. 2024, and that should be double-checked.

The broader signal here is real and underexplored. Most published refusal-steering and representation-engineering work validates on English benchmarks, and the multilingual jailbreak literature shows the symptom at the prompt level without identifying the mechanism. This project closes part of that loop for two South Asian languages with a combined speaker population above 680 million, on open-weight models with a reproducible pipeline. The conclusion that the fix cannot be post-hoc vector transfer and has to be language-specific safety training reads as the right operational take on the numbers. The work ships a useful data point for any safety team deploying these model families into Indian markets, and a reasonable starting reference for extending the same protocol to Tamil, Telugu, Bengali, and larger multilingual models.

I'd keep and share the AI's actual answers, since for a study about whether an AI refuses harmful requests, those answers are basically the whole proof. The good news is that your real finding is the strong one, which is that these small AIs barely refuse harmful requests in Hindi and Marathi at all, even before you change anything, and that's exactly the point worth leading with, because it means a company's English-only safety testing tells you nothing about whether the AI is safe in your language. So I'd build the whole writeup around that, and back it up with more examples, a few more AIs, and a human double-check of what really counts as harmful.

This paper studies 3 small models and shows that model's safety behaviour differs across these languages challenging the universal refusal direction claim.

It was heartening to see work done in authoring Marathi prompts.

I would be interested in seeing negative controls showing max-norm layers also arise for non-refusal contrasts. Then ablating those directions not reducing refusal rates will make it clear that harmful-vs-benign refusal directions are causal.

This is still useful evidence that english safety evals are not sufficient.

Cite this work

@misc {

title={

(HckPrj) Lost in Translation: Cross-Lingual Transfer of Refusal Steering Vectors in Small Language Models

author={

Pranamya Deshpande

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026