Jun 22, 2026

Safeswitch: A Localized Benchmark for LLM Safety in Urdu and Pashto Across Language Forms and Prompting Strategies

Muhammad Ali

SafeSwitch is a small benchmark that tests whether LLMs stay safe when harmful requests are written in Urdu, Pashto, romanized script, or code-switching instead of English. It uses 20 seed prompts across four safety categories (medical, scam, hate, cyber), rewritten by human into seven language forms, and three prompting strategies (zero-shot, chain-of-thought, persona). Three models were tested: GPT-5.4, Qwen3.7-Plus, and Gemma-3-4B. Responses were scored with a two-signal pipeline: rule-based refusal detection plus a GPT-5.5 judge that labels both harm and prompt comprehension. Across 588 responses, harmful output concentrated in non-English forms (16 of 17 under zero-shot), and the SLM's low harm rate on Pashto turned out to be comprehension failure, not safety. The result is a reproducible dataset, scoring code, and metrics for Urdu and Pashto LLM safety.

Review Project

See Code

View Related Sprint

Reviewer's Comments

Here is some constructive feedback to help strengthen the project, categorized by methodology, scaling, and presentation:

1. Strengthening the Methodology

- Validate the LLM Judge: You rightly noted in the limitations that the GPT-5.5 judge lacks human validation. Since frontier models can also struggle with Pashto and code-switched nuances, validating a subset of the judge's verdicts with native speakers is critical. Even a small sample (e.g., 50-100 responses) evaluated by humans would give your judge's metrics much more credibility.

- Translation Quality Assurance: The benchmark currently relies on unvalidated translations. Before expanding the dataset too much, it would be highly beneficial to have native speakers (especially for Pashto and the romanized/code-switched forms) verify that the seeds retain their intended harmful semantics and sound natural.

2. Scaling for Statistical Significance

- Expand the Prompting Strategy Subset: Your findings on Chain-of-Thought (CoT) and Persona prompting are fascinating—specifically that they push harm entirely into non-English forms for the small model. However, because this was tested on only 4 seeds (yielding very small sample sizes), the results are currently exploratory. Running the full 20-seed set through the CoT and Persona strategies should be a priority for the next iteration to see if those trends hold with statistical significance.

- Increase the Base Seed Count: 20 seeds across four categories is a great proof-of-concept for a hackathon, but scaling this to 50–100 seeds per category would make the benchmark robust enough for a major conference publication.

3. Presentation and Readability

- Include Qualitative Examples: The paper would benefit greatly from a few real examples of the model outputs. Showing exactly how a model safely deflected in English but fell for the exact same Scam prompt in Roman Urdu or Pashto would make your findings much more concrete for the reader.

- Clarify the Refusal vs. Harm Relationship: You mention that Refusal Rate (RR) and Harmful Response Rate (HRR) are not complementary. A brief example illustrating a "safe deflection" (not harmful, but missing refusal keywords) versus a "partial refusal" (contains refusal keywords but still gives harmful info) would clarify this distinction beautifully.

Overall, it's a highly relevant and well-designed benchmark for a hackathon timeline.

Great to see the somewhat novel approach of combining personas with chain-of-thought. And tracking of prompt comprehension, an easily overlooked aspect

While the novelty and depth of the paper are great, using only 4 seed prompts limits the solidity of the outcomes and conclusions. Statistically significant outcomes can be trusted more

Very intuitive way of presenting the data, organizing the paper around problem statement. Good callout and delineation between actual safety and comprehension failure

The most important next step is scaling the seed set, because several of the paper's most interesting findings, particularly the prompting-by-language interaction and the per-form harm rates, currently rest on so few data points that they read as pilot signals rather than established results. Overall this is a focused and reproducible piece of work that identifies a real gap and opens a well-defined research direction, and the three future work threads (prompting techniques, language coverage, alignment) are exactly the right priorities.

This is a clear and regionally well-motivated AI safety submission. SafeSwitch targets an important evaluation gap: LLM safety behavior in Urdu and Pashto, including native-script, romanized, and code-switched forms that reflect how users in Pakistan, Afghanistan, and diaspora communities actually write. The benchmark design is clean, crossing seven language forms with four safety categories and three prompting strategies, and the inclusion of a comprehension signal is especially valuable because it prevents incomprehension from being mistaken for safe refusal. The finding that harmful responses cluster in non-English forms, especially for scam/fraud prompts, is safety-relevant and worth investigating further.

That said, the broader area of multilingual safety gaps, code-switching attacks, language-specific safety benchmarks, and refusal robustness is already well studied in the research literature, including work such as "All Languages Matter: On the Multilingual Safety of LLMs", "Multilingual Jailbreak Challenges in Large Language Models", and recent code-switching red-teaming studies. The main contribution here is therefore localized coverage and careful benchmark construction rather than a wholly novel method. The main limitations are scale and validation: the benchmark uses only 20 seed prompts, the chain-of-thought and persona analyses use only 4 seeds, several effects rest on very small counts, and harm/comprehension labels come from a single GPT-5.5 judge without completed human spot-checking. Pashto prompt quality and romanization would also benefit from independent native-speaker review.

Overall, this is a solid, useful hackathon benchmark with a clear path forward: expand the seed set, add native-speaker and human-judge validation, sample multiple generations, include more models, and then re-test the prompting-by-language interaction at scale. I encourage the author to continue developing this work, especially because Urdu, Pashto, Roman Pashto, and code-switched safety evaluation remain important and underserved areas.

Cite this work

@misc {

title={

(HckPrj) Safeswitch: A Localized Benchmark for LLM Safety in Urdu and Pashto Across Language Forms and Prompting Strategies

author={

Muhammad Ali

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026