Jun 21, 2026

Project title: Register Sensitivity in LLM Safety Responses: Evaluating How Linguistic Style Affects Scam Detection in South African Contexts

Nicoroy Zwane

This study investigates whether linguistic register, specifically the shift from formal English to informal South African WhatsApp-style English, affects how large language models respond to scam-style prompts grounded in African fraud contexts. We constructed a 12-prompt evaluation dataset across four scenarios (NSFAS bursary impersonation, bank OTP extraction, fake investment schemes, and fraudulent job recruitment), each presented in formal, neutral, and WhatsApp-style registers, and evaluated responses from ChatGPT, Gemini, and Claude. Results show meaningful register-dependent safety degradation, particularly in investment scheme scenarios where Gemini produced unsafe outputs under informal register while refusing the same request in formal English. We argue this represents a real deployment risk for African users and present the framework as a replicable template for region-specific AI safety benchmarking.

Review Project

View Related Sprint

Reviewer's Comments

This is a tightly scoped proof-of-concept that asks the right question: does the linguistic register of a scam-style prompt change LLM safety behavior when intent is held constant? Two things stood out positively. First, the South African grounding is genuine — NSFAS bursary impersonation, bank OTP phishing, WhatsApp stokvel investment recruitment, and fake-job-with-banking-details requests are documented local fraud patterns, not US scams in translation, and that regional specificity is the kind of contribution the field genuinely needs more of. Second, the Section 6.2 dual-use subsection is one of the cleanest examples of responsible-disclosure framing I have seen at hackathon scope: explicit threat model, named mitigations, builds on publicly documented patterns, no operational scam scripts in the body. The Gemini investment-scheme register flip (SAFE in formal English, UNSAFE in both neutral and WhatsApp-SA registers for the same underlying intent) is also a striking single-cell finding worth surfacing, and the Partial Compliance Problem framing in Section 5.2 — that a warning paired with a working template still gives the fraudster a working template — is a real critique of binary HarmBench-style judging that deserves more development.

The most useful improvement is engaging with prior art on register-axis safety degradation. Qiu, Lin, Chen, Pang, Liu et al. (2023) "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models" (arXiv:2307.08487) introduces robustness-to-paraphrase as an explicit safety axis and is the closest within-language analogue to what this paper measures. Yong, Menghini and Bach (2023) "Low-Resource Languages Jailbreak GPT-4" (arXiv:2310.02446) establishes the cross-lingual register-degradation pattern that frames the within-English register finding here as a natural downstream extension rather than a new phenomenon. Citing and positioning against both would sharpen the novelty claim and shift the contribution from "we found register matters" to "we localize and operationalize a known phenomenon with SA-specific scenarios and a three-tier rubric that surfaces Partial Compliance." Two other concrete asks: release the 12 prompts, the 36 response transcripts, and the rubric notes with whatever redaction the dual-use posture requires (transcripts redacted to key phrases would let other researchers re-classify and replicate without re-running on a now-different model snapshot); and add a benign-control condition — a legitimate register-matched task (a polite NSFAS-status email in all three registers) — so readers can distinguish "register affects harm detection" from "register affects model behavior generally."

If you continue this, the natural next step is replication with two annotators on 30+ prompts per cell with inter-rater reliability reported, ideally extended to at least one other African English variety (Nigerian Pidgin, Kenyan Sheng) so the register-axis result is not idiosyncratic to South African WhatsApp register. The UbuntuGuard team would likely be useful collaborators.

This project asks whether AI chatbots become more likely to assist with fraud when requests are written in casual WhatsApp-style English rather than formal English — a question no existing AI safety benchmark has asked for South African users. Across four documented local scam scenarios and three registers, the headline finding is that Gemini correctly refused an investment-scheme request in formal English but produced genuinely dangerous output in informal and neutral phrasing. The paper also introduces a SAFE/PARTIAL/UNSAFE grading scheme that judges responses by whether they would actually help a fraudster, not just by whether the AI sounded cautious.

Strengths

1. The problem framing is original and practically important. Studying register variation within a single English variety as a safety axis is a new angle, and anchoring it to South African fraud patterns (NSFAS impersonation, stokvel schemes, OTP theft) gives the findings immediate real-world stakes.

2. The SAFE/PARTIAL/UNSAFE framework is a genuine contribution. Classifying a response by whether it would operationally assist a fraudster — regardless of included disclaimers — is more honest than a simple refuse/comply binary, and the paper applies the rule consistently.

Weaknesses

1. Every result comes from a single AI response with no repeated runs. Because chatbot outputs are stochastic, the headline Gemini finding — SAFE in formal English, UNSAFE in informal — could reverse on a second submission, and the paper has no way to distinguish a real pattern from sampling noise.

2. The 12 test prompts are withheld, so Table 1 cannot be independently verified. A researcher following the paper's design would produce structurally similar prompts, not a replication of the specific results claimed.

3. All 36 classifications were made by a single annotator with no second opinion. The PARTIAL category in particular requires judgment about how usable a response would be to a fraudster, and without any inter-rater check there is no evidence the labels are consistent.

4. Gemini is not version-pinned, even though it produced the most significant result. ChatGPT and Claude are identified by version, but "Gemini, Google DeepMind" covers variants with meaningfully different safety tuning, making Finding 2 currently unreproducible.

Very clear writing. Clear methodology that supports the final results, great use of multiple models to increase results confidence. As a follow up would be interesting to gather bigger datasets, or extend the categories, as well as understand why models fail in some and not all of them.

Cite this work

@misc {

title={

(HckPrj) Project title: Register Sensitivity in LLM Safety Responses: Evaluating How Linguistic Style Affects Scam Detection in South African Contexts

author={

Nicoroy Zwane

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026