Jun 21, 2026

Evaluating Mental Health LLM Responses To Localized African English

Ewura Ama Sam, Namirah Rasul, Habeeb Abdulfatah

Large language models (LLMs) are increasingly used for informal mental health support, particularly in low-resource settings where access to professional care is limited, delayed, or costly. In such contexts, users may rely on LLMs as a first point of contact when expressing psychological distress, often using localized or non-standard English. However, most LLM safety evaluations are based on standardized English and may not reflect performance under linguistically diverse Global South communication styles.

This project investigates whether LLM responses vary in empathy, safety, relevance, and professional alignment when handling mental health prompts expressed in localized English variants. We construct a dataset of 100 base mental health prompts and generate controlled linguistic variations, including broken English, Twi–English code-mixing, and Ghanaian English usage patterns, while preserving semantic meaning. Model outputs are evaluated against clinician-authored reference responses using a structured rubric across four safety-critical dimensions. We present the experimental design and evaluation framework, and conduct systematic comparisons across linguistic conditions.

Review Project

See Code

View Related Sprint

Reviewer's Comments

The problem statement is novel however safety mechanisms and alignment of different versions of English from context is a problem the frontier AI labs are solving / will solve, chatgpt is getting there. Mental Health responses will be met with resistence or control (similar to mythos control by the US government) and that would make this practically challenging. Though this code and dataset has wide usage in voice based models including b2b and b2c applications where I would like to see it being used from a tonality, accent and slang point of view of AI communications.

## Strengths

This is a good paper on an important and fast-growing use case. People around the world are already using LLMs as informal therapists, and that is likely to become widespread in Africa, where access to professional mental health services is severely stretched. Evaluating how a model's care quality degrades under regional language shifts is therefore a genuinely valuable thing to measure, and the team has chosen a problem that matters.

The work is well executed. The design is properly controlled, holding the semantic intent of each distress prompt constant and varying only the surface linguistic form. The use of statistical significance testing is appreciated, with paired tests reported together as a consistency check, and the results are presented clearly. The qualitative failure analysis adds real signal on top of the averages, with vivid examples such as a supportive standard-English reply turning into an outright refusal once the prompt is in broken English.

## Weaknesses

The main weakness is the limited scope. Mental health is one specific domain, so while the finding is important within it, the big-picture contribution to AI safety across the global south has a ceiling. The work is also an early signal rather than a fully validated result. It rests on a single model with manually constructed language variants.

## Recommendations for the authors

The clear recommendation is to **keep scaling this up**, because as a fully-fledged eval it could help guide the deployment of LLM-based mental health services in Africa, which would be immensely important. The most valuable steps in that direction are to **add more models** so the finding is shown to generalise, to **have native speakers validate or author the language transformations** so the conditions reflect real usage across more languge variations. Each of these turns a strong weekend signal into the kind of robust evaluation that deployers and regulators could actually rely on in Africa.

This is a great and important research direction, especially as people increasingly turn to generative LLMs for help with a wide range of concerns.

Although the research did not find any significant differences between responses to Standard English, Pidgin English and Ghanaian English prompts, except for the relevance judge, I wonder if this was because the experiment did not fully capture differences in context, environment and culture. The perturbation of the MentalChat16K dataset changes the language while preserving the underlying content. There may be sociological and cultural differences in how mental health concerns are expressed, whether they are discussed openly and what forms of support are considered appropriate. I imagine these differences would be more apparent in a multi-turn setting.

I would be interested in seeing this experiment replicated in real-world settings, using data grounded in the lived experiences of Pidgin or Ghanaian English-speaking communities.

Cite this work

@misc {

title={

(HckPrj) Evaluating Mental Health LLM Responses To Localized African English

author={

Ewura Ama Sam, Namirah Rasul, Habeeb Abdulfatah

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026