LatentGuard: Mitigating Multilingual Safety Bypass via Mid-Layer Latent Steering
Aishwarya Mukherjee, Surjit Chowdhary
Modern LLMs rely heavily on Reinforcement Learning from Human Feedback (RLHF) for safety alignment, a process that remains overwhelmingly English-centric. Consequently, when prompted in low-resource Indic languages like Bengali, models suffer from Safety Drift, systematically failing to refuse harmful inputs or exhibiting high refusal variance. Conventional mitigations attempt cross-lingual fine-tuning using synthetically translated data. However, this approach triggers Knowledge Collapse, while superficial linguistic fluency remains, factual integrity and geometric safety boundaries dissolve because the underlying representations are corrupted by poor token fragmentation. Recursive fine-tuning on these shattered representations accelerates degradation rather than preventing it.
Using mechanistic interpretability on Gemma 2 2B Instruct, we quantify this internal geometric decay. Our analysis reveals a late-stage safety bypass: the model successfully maps Bengali malicious intent to English safety anchors in its mid-layers, but experiences severe cross-lingual drift immediately prior to output generation. To address this structural failure, we introduce LatentGuard, a training-free inference intervention. By applying mid-layer latent steering, we actively project drifting Bengali representations back into the stable English safety subspace right before divergence occurs. In our evaluations, LatentGuard reduced the Bengali Attack Success Rate (ASR) on safety benchmarks by 20%, while maintaining fluency. This approach dynamically restores cross-lingual safety boundaries without the computational costs, data scarcity, or Knowledge Collapse risks inherent to synthetic fine-tuning.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) LatentGuard: Mitigating Multilingual Safety Bypass via Mid-Layer Latent Steering
},
author={
Aishwarya Mukherjee, Surjit Chowdhary
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


