Jun 22, 2026

AfriVish-Bench

Tinevimb musingadi

The project addresses a critical gap in cybersecurity: the rapid escalation of fully automated AI voice phishing (vishing) attacks targeting mobile users in Sub-Saharan Africa. Existing anti-spoofing benchmarks evaluate synthetic voice detection in a vacuum, entirely missing the real-world context of how these multi-channel attacks operate against African financial systems like EcoCash, MTN MoMo, and M-Pesa.

The New Attack Vector: Real-Time AI Agents

Historically, vishing required trained human callers, making it expensive and hard to scale. The new attack vector changes this completely:

Fully Automated & Scalable: Cybercriminals can now buy end-to-end platforms (like ATHR) that use generative AI voice agents to conduct live, autonomous social engineering calls. A single operator can target hundreds of people simultaneously without ever speaking.

Real-Time Adaptation: Unlike old, pre-recorded robocalls that failed if you asked an unexpected question, these AI agents synthesize responses in real-time with very low latency. They adapt seamlessly, making the call feel like a completely normal conversation.

The "Accent Trust" Exploit in Africa

The project highlights a specific social engineering tactic that is uniquely effective in African markets:

Implicit Corporate Legitimacy: In many African countries, a clean American or British English accent carries a strong association with institutional authority and corporate legitimacy.

Bypassing Natural Skepticism: While a mobile user might be highly suspicious of a local accent asking for sensitive account details, an AI-synthesized foreign accent bypasses these natural defenses. The victim is much more likely to trust the caller.

Low AI Literacy: General populations in the region have limited exposure to the realities of modern AI voice cloning, meaning they are less equipped to recognize synthetic audio.

How the Attack Operates

The project outlines a specific multi-channel attack flow designed to drain mobile wallets:

The Lure: The victim receives a spoofed email or SMS (e.g., "Your account has been locked") containing a callback number.

The AI Agent Call: When the victim calls the number, an AI agent takes over. Using a trusted foreign accent and a sophisticated script, the AI explains a fake security issue or account update.

The WhatsApp Handoff: Because WhatsApp is the primary communication channel in the region, the AI will often redirect the victim to add a specific WhatsApp number to continue "secure support."

Credential Theft: The AI manipulates the victim into handing over a One-Time Password (OTP) or PIN. The automated system then instantly uses these credentials to drain the victim's mobile money account.

Review Project

View Related Sprint

Reviewer's Comments

The project can be commended for identifying a real and significant threat and providing sufficient evidence to highlight the magnitude of the threat. I appreciated the explanation for the African context and the bullets describing the ways in which these attacks may take advantage of African citizens. The research gap was clear and the challenge / solution for it also well thought out. 2.3. highlighted a novel form of an existing challenge enabled by AI technologies. The author is also well aware of the limits to their plan to address the challenge and presents what appear to be workable solutions. I say this because I do not have the technical competence to adequately assess whether they are indeed workable. Throughout the author also referenced existing evidence highlighting that the question had been well considered. I was also impressed by the ethical safeguard considerations and the consideration of applicable legislation in Zimbabwe and RSA. The balance between technical detail and explainability was also thoughtfully considered, making it a well executed project.

One-line summary: A research proposal for the first African-context benchmark to detect AI voice-agent vishing (robocall to WhatsApp to mobile-money drain), built around four detection tasks, a planned 1,300-sample synthetic audio dataset, a live adversarial voice-agent test harness, and SynthID watermark evaluation. It is a proposal: nothing in it was actually built or run.

Constructive critique:

This is the sharpest problem framing I have seen so far in the batch. The threat is real and current, the regional angle is genuinely under-served (ASVspoof and similar do synthetic-voice detection in isolation, with none of the brand-impersonation, multi-channel handoff, or mobile-money context that defines the actual African attack), and the author clearly understands both the attack chain and the detection landscape. The task design is thoughtful, the label schema is concrete, the metrics are appropriate and threshold-aware, the dual-use section is responsible (no real-voice cloning, gated dataset, watermarked samples, held-out test labels), and the whole thing reads cleanly and professionally. The novelty is real: a first-of-kind African vishing benchmark with a live adaptive-adversary harness is a direction others could build on.

The problem is that it stops exactly where the value starts. This is a PRD for a benchmark, not a benchmark. Hamel Husain and Shreya Shankar make the point that an eval is what you learn from putting real outputs through it, and that you cannot know the failure modes until you look at real data. Brendan Foody frames evals as the new PRD, and that is precisely what this is, a very good PRD, with the actual eval still unwritten. In a weekend hackathon judged partly on execution, a zero-implementation proposal is a hard ceiling. Even a tiny slice run for real, say 50 synthetic clips through the SynthID detector plus one LLM-judge transcript pass, with the actual precision and recall reported, would have transformed this from "here is what I would measure" into "here is what I found."

One thing to fix directly, and it is important: Table 4 ("Expected Baseline Results") presents invented numbers (0.98 AUC, 0.92 overall) in the exact visual form of a measured results table. A reader skimming could easily mistake projections for findings. Relabel it unmistakably as targets, or better, replace it with whatever you can actually measure. The most concrete next step is also the cheapest: SynthID coverage is your single most interesting and most runnable research question (it only flags watermarks from participating generators, so open-source TTS slips through), and you could produce a real coverage-gap number this week without building the full pipeline. Minor but worth tightening: one reference is a placeholder (Joycent, arXiv:2405.XXXXX), which undercuts the otherwise strong citation work.

Track + flags:

On-topic and a clean fit for Africa, Evals & Benchmarks. Central threat claims verified as real. Dual-use is handled responsibly, though the live attack-harness recipe is the most sensitive element and should stay gated. Note for the panel: this is a proposal with no executed work, and the projected-results table is presented in a way that could read as measured data. No plagiarism signals; one placeholder citation.

The project is well-executed and addresses an important problem. However, there are two notable limitations:

- The submission is a proposal rather than an executed benchmark. No dataset was constructed, no baseline was trained, no SynthID coverage was measured, and no live voice-agent harness was built; the "expected baseline results" table is populated with projected numbers based on published ASVspoof performance. As a research proposal it is thorough, but there is no empirical contribution to evaluate — the core claims about detectability of ATHR-style attacks, the Gemini/Claude LLM-judge accuracy, and the SynthID gap are all forward-looking rather than tested.

- Several design choices are asserted rather than justified against alternatives. The 1,300-sample size, the specific accent-group split (300 American / 200 British / 200 African-English / 100 mixed), the four-task structure, and the weighted overall score formula (0.35·T1 + 0.25·T2 + 0.20·T3 + 0.20·T4) are presented without power analysis, ablation, or comparison to how existing anti-spoofing benchmarks partition their tasks. Without piloting even a small portion of the dataset, it is difficult to know whether the proposed splits would be sufficient to distinguish detector performance at meaningful confidence.

Cite this work

@misc {

title={

(HckPrj) AfriVish-Bench

author={

Tinevimb musingadi

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026