When the Safety Circuit Doesn't Speak Igbo: Asymmetric Cross-Lingual Transfer of Harm Representations in Qwen2.5-1.5B-Instruct
Chijioke Ubajaka
I test whether the difference-of-means "harm direction" that mediates refusal in Qwen2.5-1.5B-Instruct transfers from English to Igbo (~45M speakers, Nigeria). Using a 50-pair matched English-Igbo benchmark, I measure per-layer geometric overlap, cross-lingual AUROC transfer (with bootstrap CIs against same-language ceilings), and causal directional ablation during generation. I find a sharp asymmetry: the English-derived harm direction is at chance for Igbo at every layer and, when ablated at its point of peak geometric overlap, leaves Igbo refusal completely unchanged (12/12 baseline vs. 12/12 ablated) , while the Igbo-derived direction surprisingly transfers near-perfectly onto English. I also find that many baseline Igbo refusals read as generic comprehension failures rather than targeted, intent-aware refusals, suggesting low-resource-language safety may currently be accidental rather than designed.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) When the Safety Circuit Doesn't Speak Igbo: Asymmetric Cross-Lingual Transfer of Harm Representations in Qwen2.5-1.5B-Instruct
},
author={
Chijioke Ubajaka
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


