Jun 22, 2026

Evaluating LLM Safety Across Intra-Language Variations: A Study of Regional Spanish Slang

Fernando Gonzalez Ruiz, Alvaro Nuñez Perez, Itzel Aurora Hernandez Fernández , Ángel Alejandro Reynoso Delgado

Large Language Models (LLMs) are increasingly used by speakers across many linguistic communities, but most safety evaluations treat each language as a single uniform category. This project studies whether regional Spanish slang creates safety evaluation gaps that are not visible when only neutral Spanish is tested. We adapted 20 harmful-intent prompts from HarmBench into six Spanish variants: neutral Spanish, Guadalajara Mexican Spanish, Buenos Aires/Rioplatense Spanish, Santiago Chilean Spanish, Barcelona/Peninsular Spanish, and Uruguayan Spanish. We built two local pipelines: one exploratory pipeline for generating model responses with open-source models, and a second structured pipeline that evaluates all 120 prompt variants with three local Ollama-based judges focused on security, policy compliance, and adversarial/critical review. In the full judge run, all 120 prompts were evaluated by all three judges, producing 360 evaluations with no parse failures. The judges rated the prompts as high-risk overall across every Spanish variant, with mean risk scores ranging from 0.900 to 0.927. Disagreement was rare: only 3 of 120 prompts produced any safe/unsafe vote disagreement. These results do not show strong evidence that regional Spanish variants systematically bypassed the safety judges in this run, but they demonstrate a reproducible framework for testing intra-language safety variation and reveal where judge disagreement can flag prompts for closer review.

Review Project

See Code

View Related Sprint

Reviewer's Comments

The research question here is genuinely original: does regional slang create safety blind spots within a single language? That's a fresh angle nobody else is exploring. Unfortunately, the execution doesn't answer this question due to a fundamental design issue.

Strengths:

- The intra-language framing is novel and interesting. Everyone else tests across languages; this asks whether variation within Spanish creates gaps.

- The prompt crafting shows real linguistic expertise. Authentic regional slang across Mexican, Argentine, Chilean, and other variants, not just vocabulary swaps.

Suggestions for Future Work:

- The core issue: the pipeline tests whether LLM judges can classify harmful prompts in slang (they can, easily), not whether target models comply with them. That's the wrong experiment for the research question. Redirecting the pipeline to test model compliance would be much more informative.

This paper asks whether regional Spanish slang opens safety evaluation gaps that uniform-language testing misses. To answer it, the authors adapt 20 high-severity HarmBench behaviors into six registers (Guadalajara, Buenos Aires, Santiago, Barcelona, Uruguay, plus a neutral baseline) and produce 120 prompts that flow through two local pipelines. One pipeline runs the prompts through three Ollama models (phi3, mistral, llama3) under a custom R0 to R4 permissiveness rubric. A separate judge pipeline submits the same 120 prompts to three stateless judges (security via llama3.1:8b, policy via mistral:7b, critical via qwen2.5:7b) and aggregates six disagreement metrics across 360 evaluations.

What jumped out on the methodology side is the within-prompt design, which pairs every base behavior across six registers and isolates dialect as the independent variable. One thing that stood out is the clean split between a response-permissiveness layer and a prompt-classification layer, which lets the authors name a detection-to-compliance gap as an empirical finding instead of a background assumption. Structured JSONL outputs (prompts.jsonl, evaluations.jsonl, disagreement_metrics.jsonl) paired with config-driven judge prompts ship as a reproducible artifact another team can extend without reverse-engineering the setup.

Three directions would sharpen the next iteration. First, the 20-behavior cap pushes the three-judge agreement signal into ceiling territory (mean risk 0.900 to 0.927), and that range leaves almost no statistical room for register-level differences, so a power analysis at this sample size would clarify what effect sizes the design can actually detect. Second, DeepSeek covers both slang adaptation and R0 to R4 labeling, which creates a circular dependency the authors flag but never quantify, and a 20-to-30 item human-annotated subsample from native speakers would calibrate the LLM-judge ratings against ground truth. Third, a single closed-API call per prompt would add a frontier-model baseline so the open-source findings sit against the safety surface production users actually meet, given that current results apply only to phi3, mistral, and llama3 at the quantization levels tested.

The intra-language framing extends the cross-language safety literature (Jain et al. 2024, Li et al. 2024) into a dimension that multilingual AI safety work has not covered well for Spanish-speaking populations, and the reproducible pipeline for regional dialect evaluation gives later teams an infrastructure piece they will not have to build from scratch.

The project addresses an underexplored and important question: whether dialectal variation within Spanish can create safety gaps that standard evaluations miss. The distinction between a "detection–compliance gap" is the paper's most valuable conceptual finding and deserves further development. To strengthen the work, the most importsnt next steps are separating the system that generates slang variants from the system that annotates responses (using DeepSeek in both roles introduces a systematic bias that is difficult to quantify), increasing the base prompt count to at least 50-100 for adequate statistical power, and evaluating frontier models via API in addition to the small local models, whose safety behavior may differ substantially. The main null result (no systematic bypass was found) has value if presented explicitly as such: it demonstrates that the pipeline works and establishes a baseline for future research with more sophisticated dialectal variants.

Cite this work

@misc {

title={

(HckPrj) Evaluating LLM Safety Across Intra-Language Variations: A Study of Regional Spanish Slang

author={

Fernando Gonzalez Ruiz, Alvaro Nuñez Perez, Itzel Aurora Hernandez Fernández , Ángel Alejandro Reynoso Delgado

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026