Lost in Translation: Cross-Lingual Transfer of Refusal Steering Vectors in Small Language Models
na
This project investigates whether refusal steering vectors — internal directions in a language model's residual stream that encode the decision to refuse a harmful request — transfer across languages from English to Hindi and Marathi.
We extracted English refusal directions from three small open-weight models (Llama-3.2-3B-Instruct, Gemma-2B-IT, Qwen-2.5-3B-Instruct) using contrastive activation analysis: computing the difference between mean hidden states on harmful vs. benign prompts at each transformer layer. We then ablated that direction at inference time and measured whether refusal rates dropped on Hindi and Marathi harmful prompts.
The results are clear. English refusal directions produce zero transfer to Marathi across all three models. Transfer to Hindi is partial in one model (Qwen, Δ=+0.067) and zero in the others. More critically, base refusal rates for Hindi and Marathi prompts are near zero before any intervention — 0–11.7% compared to 35–90% for English — meaning these models were never safety-aligned for South Asian languages in the first place.
For the 680 million speakers of Hindi and Marathi, this is a concrete and measurable safety gap: an equivalent harmful request in Marathi bypasses safety guardrails that would block the same request in English, not through any sophisticated attack, but simply by language choice.
Our findings challenge claims of universal refusal representations and motivate mandatory multilingual safety evaluation before AI deployment in South Asian markets.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Lost in Translation: Cross-Lingual Transfer of Refusal Steering Vectors in Small Language Models
},
author={
na
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


