Blindfold - Blindly Auditing the Vietnamese LLM Safety Blind Spot using Secured Enclaves
Khoa Duy Nguyen, Rasswanth S
Frontier labs mainly carry out red-team safety evaluations in English, on globally recognized harms. This leaves two
blind spots: local or regional harms, and a non-English refusal gap where a model refuses an English request but complies on the
identical one in another language — Deng et al. (ICLR 2024) measured ChatGPT unsafe 0.63% in English versus 7.94% in
Vietnamese on identical prompts. doing such evaluations on real frontier models requires three mutually distrustful parties: an AI Lab with private weights, an AI Safety org with a private benchmark, and an Auditor with eval code to cooperate without revealing their secrets to one another. We built (i) a blind-audit harness: a code-to-data flow on OpenMined’s syft-client where
weights, benchmark, and code meet inside a sealed enclave that both data owners review and approve, and only a signed scorecard
exits; and (ii) a 47-prompt bilingual (EN↔VN) local-harms benchmark, every harmful prompt citing a real Vietnamese source.
Running the harness across four models (qwen2.5-0.5b/3b, phogpt-4b, seallm-v3-7b), we found the Vietnamese-
specialized model the least safe in Vietnamese: it refused only 14% of harmful prompts in VN versus 38% in English — the
worst gap of any model — while safety otherwise tracked size and alignment, not language coverage. Auditing in English alone
would have over-rated exactly the model marketed for Vietnamese. All data and code are at github.com/khoaguin/blindfold.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Blindfold - Blindly Auditing the Vietnamese LLM Safety Blind Spot using Secured Enclaves
},
author={
Khoa Duy Nguyen, Rasswanth S
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


