Jun 22, 2026

Blindfold - Blindly Auditing the Vietnamese LLM Safety Blind Spot using Secured Enclaves

Khoa Duy Nguyen, Rasswanth S

Frontier labs mainly carry out red-team safety evaluations in English, on globally recognized harms. This leaves two

blind spots: local or regional harms, and a non-English refusal gap where a model refuses an English request but complies on the

identical one in another language — Deng et al. (ICLR 2024) measured ChatGPT unsafe 0.63% in English versus 7.94% in

Vietnamese on identical prompts. doing such evaluations on real frontier models requires three mutually distrustful parties: an AI Lab with private weights, an AI Safety org with a private benchmark, and an Auditor with eval code to cooperate without revealing their secrets to one another. We built (i) a blind-audit harness: a code-to-data flow on OpenMined’s syft-client where

weights, benchmark, and code meet inside a sealed enclave that both data owners review and approve, and only a signed scorecard

exits; and (ii) a 47-prompt bilingual (EN↔VN) local-harms benchmark, every harmful prompt citing a real Vietnamese source.

Running the harness across four models (qwen2.5-0.5b/3b, phogpt-4b, seallm-v3-7b), we found the Vietnamese-

specialized model the least safe in Vietnamese: it refused only 14% of harmful prompts in VN versus 38% in English — the

worst gap of any model — while safety otherwise tracked size and alignment, not language coverage. Auditing in English alone

would have over-rated exactly the model marketed for Vietnamese. All data and code are at github.com/khoaguin/blindfold.

Review Project

See Code

View Presentation

View Related Sprint

Reviewer's Comments

Very clear and good work

Blindfold ships a code-to-data audit harness that runs on OpenMined syft-client in two configurations, an in-memory demo and a GCP Confidential Space deployment with remote attestation. The benchmark covers 47 bilingual prompts grounded in Vietnamese government and Ministry of Health sources. The team measures four models (qwen2.5-0.5b, qwen2.5-3b, phogpt-4b, seallm-v3-7b) and surfaces a sharp finding, the Vietnamese-specialized phogpt-4b is the least safe in Vietnamese. The work takes on two coupled problems at once, the English-Vietnamese refusal gap in frontier LLMs and the mutual-distrust barrier that blocks shared evaluation on production weights.

A few things stood out on the contribution side. The locally authored scam and medical subsets cite real Vietnamese advisories (the AIS catalogue of 24 online fraud forms, đắp lá cancer warnings, fake VNeID app scripts), and these are prompts an English-translated benchmark cannot generate, which is what justifies the native_cultural vs translated harm-origin split the analysis depends on. The protocol also stays disciplined. The enclave emits only the raw output, the LLM judge runs off the private boundary so no Anthropic key enters the sealed environment, and the keyword fallback together with benign over-refusal controls caught the qwen2.5-0.5b artifact where a negative gap was actually indiscriminate refusal. The headline finding (phogpt-4b refuses 14 percent of harmful VN prompts vs 38 percent in English, the worst gap of the four models) is interpretable, actionable for the open-model ecosystem, and exactly the result an English-only audit would miss.

A few directions would strengthen the work. The attestation story could be tightened in the report itself. Section 5.4 references the Google-signed JWT, secure-boot, debug-disabled, and the immutability filter, and the paper would benefit from showing the attestation claim set the data owners actually verify, the failure modes if the JWT signer key rotates or the launcher image hash diverges, and how the OAuth-token release path on Secret Manager binds to the attested image and not to the service account alone. One thing worth flagging is the side-channel and inference-code exfiltration vector listed as out of scope. Even a one-paragraph threat model on what a malicious researcher payload could leak through timing, output length, or controlled-token responses would let a regulated buyer (a central bank auditor, say) judge residual risk. The 47-prompt, single-seed, greedy-decode protocol stays honest about being directional, and a next-step table with target seed counts, sample size per category, and a planned inter-rater agreement check on the LLM judge would convert the directional claim into something a downstream lab can replicate.

For AI safety this is a useful template. It operationalizes trust-minimized evaluation in a way that matches how regulated industries already think about audit (sealed compute, attested code, single signed artifact out). It also shows that language-specialization without aligned safety data is a measurable hazard, not a hypothetical one.

One of the best entries. It lets a lab, a safety team, and a checker who don't trust each other run the same test without showing each other their private data, and it adds a set of Vietnamese harmful prompts to test with. The main finding stands out: the model made for Vietnamese is the least safe in Vietnamese — a problem an English-only test would never catch. As you note, the numbers are early: one run, a small setup, 47 prompts. Next: run it more times, use a checked way to score answers, and do one full run on the real secure system. Also fix a few typos.

Blindfold is architecturally the most original submission in this pool. The three-party blind-audit harness — where model weights, benchmark, and eval code meet only inside a sealed hardware enclave and only a signed scorecard exits — addresses a real structural problem in independent AI auditing. The local-harms benchmark grounded in Vietnamese government sources (AIS fraud catalogue, Ministry of Health misinformation warnings) is exactly what the Global South track asked for, and the PhoGPT finding is genuinely striking: the Vietnamese-specialized model is the least safe in Vietnamese, and an English-only audit would have missed it entirely.

Scale the benchmark and run multiple seeds. 42 harmful prompts across two local categories, one run, greedy decode is directional evidence — not a citable finding. Even doubling to ~100 locally-authored prompts with three seeds and reported variance would significantly harden the PhoGPT result. The architecture is ready for this; the bottleneck is benchmark scale.

Replace heuristic refusal scoring with a validated judge. The paper notes an optional LLM judge is available but uses heuristic scoring for all reported numbers. Reporting inter-rater reliability between the heuristic scorer and the LLM judge — even on a subset — would substantially increase confidence in the gap numbers.

Foreground the architecture contribution. The paper's most novel contribution is the blind-audit harness, but the abstract and structure lead with the PhoGPT measurement. Positioning the architecture as the primary contribution — with the measurement as proof-of-concept — would better reflect where the real impact potential lies.

Cite this work

@misc {

title={

(HckPrj) Blindfold - Blindly Auditing the Vietnamese LLM Safety Blind Spot using Secured Enclaves

author={

Khoa Duy Nguyen, Rasswanth S

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026