Jun 21, 2026

vigilAI

Diana Chang, Ian Duhamel Hayes, Pedro Quezada

vigilAI — Auditing LLM compliance with Brazil's PL 2338/2023

AI-safety evaluation is overwhelmingly English-language and built around EU/US law, which leaves a blind spot: a model that passes a Western audit may still violate a Global-South statute written in another language and grounded in different rights. vigilAI tests this for Brazil's PL 2338/2023 (the AI bill approved by the Senate in December 2024) by forking COMPL-AI, the EU-AI-Act evaluation framework, and adding five Brazil-specific benchmarks mapped to the bill's Chapter II rights: AI disclosure (Art. 5, I), non-discrimination over Brazilian categories (Art. 5, III), the full high-risk rights triad — explanation, contestation, and human review (Art. 6, I–III) — and the Algorithmic Impact Assessment (Arts. 25–28). Because two benchmarks reuse the exact same scorer as their EU original, the same-model EU↔Brazil delta isolates the effect of language and legal framing from raw model strength.

Across six models from five developers (8B to frontier), all six disclose being an AI ~95–100% of the time in English but only ~50–55% of the time when asked in Portuguese under Brazilian law — a ~0.45 compliance gap invisible to any English benchmark. The takeaway: certification under a foreign regime is not compliance, and localized, statute-referenced evaluation is both necessary and achievable.

Review Project

See Code

View Presentation

View Related Sprint

Reviewer's Comments

This is a high-impact project that successfully bridges local legislation with robust engineering practices. The findings regarding language disclosure gaps are particularly relevant and make a meaningful contribution to ongoing discussions in the field. To further strengthen the validity of the results, expanding the native dataset through automation would help improve representativeness and reduce the risk of statistical artifacts.

vigilAI forks COMPL-AI, the LatticeFlow / ETH Zurich / INSAIT EU AI Act evaluation suite, and re-points it at Brazil's PL 2338/2023, the AI bill approved by the Federal Senate in December 2024. The fork keeps all 30 EU benchmarks intact on top of UK AISI's Inspect AI harness and adds five Brazil-specific tasks: human_deception_brazil (Art. 5, I disclosure), bbq_brazil (Art. 5, III non-discrimination over IBGE racial, regional, intersectional, religion, and class categories), explanation_quality (Art. 6, I), contestation_review (Art. 6, II and III), and aia_checklist (Arts. 25-28). Two benchmarks reuse the upstream scorer verbatim, so the EU minus Brazil delta isolates content (Portuguese plus statutory framing) from model strength. Six models from four developers were run, two Anthropic frontier under a scaled config and four open-weight under a pilot config. The headline finding is a tight, reproducible disclosure gap of about -0.45: every model denies being human ~95 to 100% of the time in English yet only ~50 to 55% of the time on Portuguese plus LGPD-framed prompts.

What jumped out is the cleanness of the methodology. Forking COMPL-AI rather than reimplementing keeps the scorer identical across jurisdictions, which removes a confound most cross-language audits ignore. The deterministic six-element rubric scorers for explanation, contestation, and AIA (no LLM judge, multilingual cue lists tuned against real model output, 173 unit tests, runs under a mock model at zero cost) sidestep the reproducibility and circularity problems that plague rubric-based LLM evaluation. The legal mapping is accurate: Art. 5, I prior information, Art. 5, III non-discrimination, the Art. 6 high-risk rights triad, and the Arts. 25-28 AIA "public conclusions" obligation are stated correctly and tied to LGPD Art. 20 the way Brazilian practitioners actually frame them. The methodological finding that scaling the EU bbq baseline from 20 to 1000 samples flipped Haiku's bias delta sign from +0.05 to -0.18 is a publishable point in its own right and an honest argument against the cheapest baseline.

Where the work would gain most is sample power and validation. Worth flagging: (1) bbq_brazil at 44 hand-authored scenarios and the rubric tasks at 3, 4, and 1 scenarios respectively are too small to call the bias and Art. 6 findings significant, and native-annotator validation of the Portuguese stereotypes is acknowledged as pending. The authors are upfront about this, and the disclosure result is robust enough to carry the paper, but the report sometimes reads more confident on bias and Art. 6 than the per-scenario standard errors support. (2) The deterministic cue lists for the rubrics measure whether a compliant description is produced, not real-world behavior under a Brazilian deployment. A cross-check with an LLM-judge variant on a held-out scenario set would let the team report how much of the score is keyword surface and how much is genuine procedural reasoning. (3) Local models ran at --limit 20 with one epoch, so the per-model open-weight numbers (especially aia_checklist at n=1) are noted as single observations. Scaling those to the frontier config would tighten the cross-developer claim and is cheap on Ollama. (4) Sector overlays, mentioned in future work, are where this becomes deployable: BACEN circulars for finance, ANVISA for health, and a CVM angle for capital markets would let the Art. 28 scorecard double as input to a real sectoral AIA, which is the leverage point a deployer actually needs.

The broader signal is that this is exactly the localized, statute-referenced AI safety evaluation Global South AI governance has been missing. The claim that a model can pass an EU audit and still violate a Brazilian statutory right is not just plausible, it is empirically demonstrated here with near-zero variance on the EU side. The Art. 28 scorecard framing is a clever piece of regulatory engineering: a single eval run produces the exact "public conclusions" artifact that high-risk deployers will be required to publish, so the tool sits naturally on the path that compliance work has to take anyway. As PL 2338/2023 moves through the Câmara dos Deputados and ANPD issues guidance, the same fork-and-extend pattern documented here is the template other Global South jurisdictions can copy.

VigilAI is a sharp and well-motivated contribution: forking COMPL‑AI into a Brazil‑specific suite and keeping the original 30 EU benchmarks makes the EU–Brazil deltas genuinely interpretable as legal‑frame and language effects rather than just model strength. The disclosure finding is especially compelling: it shows that six strong models nearly always deny being human in English but fail to do so about half the time on Portuguese / LGPD‑framed prompts, while still performing well on Art. 6 rights, refines the narrative from “models fail Brazil” to “they specifically fail AI disclosure”. At the same time, the empirical backbone is still small (a handful of Brazil‑specific benchmarks and relatively few scenarios per right), and some of the high‑risk rights are tested with very compact templates, so the reported gaps are best read as strong warning signals rather than comprehensive compliance measurements. Strengthening the work by expanding the Brazilian scenario set (especially for non‑discrimination and AIA), adding at least one independent annotator or LLM‑judge for borderline cases, and including a couple of concrete example prompts and model outputs in the main text would make the benchmark more reusable for regulators and operators beyond the hackathon context.

This article is well-researched, clear and well presented. With its examples, it technically and clearly demonstrates that compliance with regulatory frameworks from other regions does not automatically mean compliance with rights in our Global South regions/countries. As you point out, this may lead to risks and damages that are still not considered systematically by stakeholders in AI governance, at all levels. Your article contributes to demonstrating the importance of this issue.

As many of us who work defending human rights and others in academia advocate for localized AI normative and regulatory frameworks based on rights/public policies/equality analysis, it is of essence that we join forces with colleagues in AI Safety who arrive to similar conclusions through AI Safety research such as this paper.

Similarly, I encourage the authors to strengthen their research by incorporating human rights more explicitly and the views of representatives of marginalized groups/populations themselves. I encourage you to read more about decolonial and feminist approaches to AI governance and research, in which I believe you will find useful elements for your work.

Thank you for this paper, I look forward to learning more about your research and your plans to use it to inform and impact AI policies.

Cite this work

@misc {

title={

(HckPrj) vigilAI

author={

Diana Chang, Ian Duhamel Hayes, Pedro Quezada

date={

6/21/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026