Beyond English: Assessing the Robustness of LLM Safety Mechanisms Against Structural Jailbreaks in Spanish
Jonathan Narvaez
LLM safety guardrails are trained mainly to catch direct, single-turn harmful requests in English, leaving open whether structural attacks — long-context framing and multi-turn escalation — work just as well in Spanish, particularly in dialects like Bogotá Spanish. This matters for Latin America's emerging AI governance frameworks, which assume safety guarantees never verified for Spanish deployments. We adapted 16 AdvBench-style harmful instructions into long-context and multi-turn attack formats and tested them against Gemini's standard chat interface under a black-box threat model. An exploratory session combining fictional framing with multi-turn escalation produced harmful content across all three tested categories — identity theft, violence-inciting content, and discriminatory speech — and across a broader set of 16 sub-objectives, 62.5% showed some degree of safety bypass. Gradual escalation, not framing alone, drove compliance, with partial-compliance responses acting as a foothold for further escalation rather than a safety backstop. While this single-session pilot lacks statistical power and a control group, it offers an existence proof that Spanish-language structural jailbreak vulnerabilities are real, motivating controlled replication as a priority for regional AI safety evaluation.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Beyond English: Assessing the Robustness of LLM Safety Mechanisms Against Structural Jailbreaks in Spanish
},
author={
Jonathan Narvaez
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


