Exploratory Benchmark of Jailbreak Robustness Across Global South Languages
Rithvik Krishna D K, Dhananjai Yadav, Koushik Reddy KM
Large language models deployed globally remain heavily evaluated in English, leaving a critical gap in understanding safety alignment across low-resource languages. We present an exploratory multilingual jailbreak benchmark evaluating Llama-3.1-8B across four language-model pairs: English, Hindi, Vietnamese, and Tagalog. Using 100 prompts from HarmBench with semantic similarity-validated translations (back-translation threshold: 0.75), we find non-English language variants exhibit 14–16 percentage points higher jailbreak success rates than the English baseline (p < 0.05). Category-level analysis reveals social engineering prompts are most vulnerable across all language-model pairs (68–72%), while malware-related prompts show the highest refusal rates (34–40% jailbreak success). These findings highlight meaningful variation in safety behavior under prompt translation and underscore the need for multilingual safety evaluations beyond English-centric benchmarks, particularly as frontier models are deployed in Asia.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Exploratory Benchmark of Jailbreak Robustness Across Global South Languages
},
author={
Rithvik Krishna D K, Dhananjai Yadav, Koushik Reddy KM
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


