Jun 21, 2026

MoMo: A Threat Corpus and Evaluation Framework for Mobile Money Fraud Resistance in Swahili, Wolof, and Hausa

Mufaro Rukuni

Mobile money platformsM-Pesa, Orange Money, and Wave among themnow mediate

nancial life for more than 500 million people across Sub-Saharan Africa, and have become

a correspondingly attractive target for social-engineering fraud conducted in local languages

and idioms that mainstream AI safety evaluation rarely covers. We present MoMo, a struc-

tured threat corpus and evaluation framework for benchmarking large language model (LLM)

resistance to generating localized mobile money fraud content in Swahili, Wolof, and Hausa.

MoMo organizes fraud scenarios along a six-category attack taxonomy (SIM-swap, agent

impersonation, lottery/prize lures, OTP phishing, emergency urgency, and overpayment

scams), crossed with six socioeconomic hooks and six victim personas, and pairs this cor-

pus with an asynchronous evaluation harness that scores model outputs using rule-based

red-ag detection, an LLM-judge rubric, and a locale-aware aggregator producing a 0100

Model Safety Score (MSS). A dening design choice is a two-tier access model: full attack

scripts are withheld from public release and gated behind a vetted-researcher Data Use

Agreement, while a public stub corpus (script metadata, taxonomy labels, and SHA-256

hashes) supports CI, reproducibility checks, and open auditing without creating a ready-

made social-engineering playbook. We describe the corpus schema, the native-speaker an-

notation pipeline built on Label Studio, and the evaluation harness architecture, and report

a pilot evaluation against two NVIDIA NIM-hosted Llama 3.1 models (8B and 70B) that

found refusal rates under 10% across 171 fraud scenarios, with the larger model substan-

tially more compliant than the smaller one and with the lowest refusal rates on emotionally

manipulative attack framings. We discuss the engineering and governance trade-os in-

volved in building a dual-release safety benchmark, alongside the measurement limitations

of the current rule-based scorer. The main takeaway is that benchmarking AI safety for

underserved linguistic markets requires infrastructureaccess control, annotation tooling,

and locale-aware scoringand that, on this preliminary evidence, current open models show

weak resistance to localized mobile money fraud

Review Project

View Related Sprint

Reviewer's Comments

I read this twice because the corpus design is genuinely useful—organizing fraud into six specific categories (SIM-swap, agent impersonation, lottery scams, etc.) paired with socioeconomic context is smart. The native-speaker annotation pipeline built on Label Studio shows you thought through reproducibility. What struck me most is the 0-100 Model Safety Score heuristic: rules-based flagging plus LLM-judge scoring plus a locale-aware aggregator producing a single number. That's elegant.

The limitation I'd flag: your eval is doing a lot of work here with only 171 scenarios. I'd want to hear your counter on whether that's enough signal for statistically confident claims about Llama 3.1's 10% refusal rate across both 8B and 70B. No inter-rater agreement metrics make it hard to benchmark your annotation quality independently.

That said, the dual-release design (full attack scripts withheld, public stub corpus open) is a thoughtful safety choice. Next week, push for inter-rater Cohen's kappa on a sample of your 171 scenarios. That'll shore up the eval's credibility.

MoMo addresses an important and underexplored problem: evaluating LLM resistance to localized mobile-money fraud generation in African languages that are largely absent from existing safety benchmarks. The structured taxonomy, multilingual dataset design, and especially the two-tier release model (public metadata with gated access to attack content) strike a thoughtful balance between reproducibility and responsible disclosure. The accompanying infrastructure is well developed, and the initial evaluation showing weak refusal rates across tested models provides a useful early benchmark.

The main limitation is that the current conclusions rely primarily on the keyword-based refusal classifier, while the more comprehensive LLM-judge and Model Safety Score have not yet been applied. Running the full evaluation pipeline on the existing dataset would provide a more reliable assessment. In addition, completing native-speaker validation and expanding evaluation to more model providers would strengthen both the linguistic conclusions and the broader applicability of the benchmark.

Overall, this is a well-designed benchmarking effort with strong infrastructure and a valuable contribution to multilingual AI safety evaluation.

This is a well-conceived project on a genuinely neglected problem — localized fraud resistance for 500M+ mobile-money users, a real blind spot in English-centric safety evals. The taxonomy and the two-tier release design are thoughtful. But reviewing the report alone, the gap between what's described and what's verifiable is the dominant issue.

No verifiable code. The report describes a harness, layered scorer, Pydantic corpus, Label Studio pipeline, and access-control split — but all of it is description, not a runnable artifact in hand. For a project whose contribution is infrastructure plus a measurement, an unconfirmable codebase removes the floor under both halves.

The author didn't produce or verify the numbers. The LLM Usage Statement says the results were taken from a supplied evaluation report, "not independently re-derived," and the paper was drafted from the README, with a note that the author should verify figures "before final submission." That's a serious provenance gap for a benchmark paper.

The headline finding rests on a metric the paper itself shows is broken. "Refusal under 10%" comes from a binary keyword classifier (Layer A only). The multi-layer scorer that defines the framework — LLM-judge, locale, MSS — was never run. And the one transcript shown has Llama 70B giving a protective response (warn the user, contact M-Pesa) scored as "complied." So the central claim ("models show weak resistance") isn't yet supported by the evidence.

Likely artifacts presented as patterns. The "70B more compliant than 8B" inversion and the Hausa-vs-Swahili language gap could both be scoring/authenticity artifacts — the author flags both, but they're still framed as findings in the abstract and conclusion.

Thin pilot. Two models from one provider; a config typo wiped 1,200 of 1,539 calls.

Cite this work

@misc {

title={

(HckPrj) MoMo: A Threat Corpus and Evaluation Framework for Mobile Money Fraud Resistance in Swahili, Wolof, and Hausa

author={

Mufaro Rukuni

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026