Jun 22, 2026

Three Detectors, Three Failure Profiles: Detector-Specific Demographic Bias in Public Deepfake Detection and Its Implications for Content Moderation in Latin America

jose ernesto ramirez GARCIA, Angel Pineda Aguirre, Daniel Alberto Curiel Vargas

This project audits whether three public Hugging Face deepfake detectors protect demographic groups equally in content moderation contexts relevant to Latin America and Mexico’s Ley Olimpia. It measures false-negative rates, meaning synthetic faces classified as real, using real FFHQ/Flickr faces and synthetic StyleGAN faces from the 140k Real and Fake Faces dataset. Apparent demographics were labeled with FairFace. The main finding is that bias is detector-specific: the highest false-negative group changes by model—Black for prithiv v1, Asian for dima806, and White for prithiv v2. Only dima806 shows a statistically significant racial difference, while prithiv v1 shows a significant gender difference. The audit also finds that strong advertised model metrics do not transfer reliably: dima806 and prithiv v2 perform near chance at the default threshold. The project concludes that detector fairness must be audited per deployment using disaggregated error rates and tested under real operating conditions before use as protection tools.

Review Project

View Related Sprint

Reviewer's Comments

This study tackles an urgent social challenge for content moderation in Latin America and convincingly demonstrates how aggregated metrics can mask important demographic disparities. Despite computational limitations, the project establishes a strong foundation for future work. Replicating the analysis using techniques that more closely resemble real-world threat scenarios would further enhance the practical relevance of the findings.

This paper aims to contribute to the prevention and response to one of the AI-related harms that is disproportionately affecting the most marginalized women: deepfakes of non consensual sexual imagery. While this is very important work, as you move forward in your research, my main suggestion is to incorporate in your team social scientists, human rights experts/defenders, and/or link with Mexican organizations already working on response to this type of gender-based violence. This will both strengthen your analytical work, and help you make sure that your research is useful for those already working in this area.

While the methodology of the paper is technically solid, I have two concerns:

* First, the need to problematize the concept of "race" and differentiate it from ethnicity,

* Second, the fact that the categories that are used as classifications (Middle Eastern, Indian, Latino...) were not developed in Mexico or Latin America, and therefore do not reflect the diversity of our countries in a way that is significant for us.

For instance in Table 1: "Latino" may be an ethnic group in the US, but what does it mean in Latin America where we have people of many different characteristics and identities? Which of the categories included there are supposed to represent the Indigenous populations from our countries? Even for the US, the category "Latino" may not be the most appropriate for the purpose of face identification. Similarly, what do you mean by Western features? Probably "whiteness". While the work already presented in your paper shows important results, this theoretical framework would much benefit from being modified as you move forward, including by reviewing research in this area and census data done in our own countries.

I also suggest you review the work that has been carried out by agencies such as UNFPA to incorporate international standards and concepts, for instance those of "technology-facilitated gender based violence" and "non-consensual sexual imagery using deepfakes".

I congratulate you for investing your time and effort in this topic that affects so many girls and women around the world. Working alongside experts in gender-based violence and social studies will make your research more solid and with more potential to have an impact that is much needed in Mexico and elsewhere.

What makes this research so crucial is how it dismantles the myth of a "universal" safety standard. It proves that we cannot blindly import black-box moderation models and expect them to protect all demographic groups equally. When a victim of non-consensual intimate imagery seeks recourse, their fundamental right to protection should not depend on the specific, un-audited biases of the detector a platform happens to use. The main findings solidify a critical governance mandate: algorithmic fairness cannot be assumed or inherited. For platforms operating in the Global South, relying on standardized, Western-centric safety benchmarks is insufficient and negligent. We need mandatory, deployment-specific auditing for any system acting as a gatekeeper for content moderation and victim protection. This is exactly the kind of concrete, disaggregated evidence required to shift AI policy from theoretical ethics to enforceable liability.

On impact, I would say that, although the problem is well-defined, the scope is quite limited, considering the grand scheme of AI Safety related challenges, although a fair one nonetheless. The technique the article use to address the problem, which is an evaluation of false negative rates of some public deepfake detectors, is not innovative enough so that one could consider how it transfer to these larger problems.

On execution itself, considering the paper on its own terms, I would say that the finding that they consider to not be the central one is the central and more reliable one, which is the unreliability on distributional shift. First, the models advertise general performance metrics that do not hold in the specific application of detecting fake faces, without disclosing exactly what they mean by detecting fake faces. They could defend themselves by saying that they are actually detecting in-the-wild face forgeries, but they are not specifying this claim, instead making a general one that do not hold in a specific application. So, in this sense, the paper finds something that is true. However, what the paper claims to be central I would argue is less central, given the finding that the detectors are already generally unreliable (as deployed at the default threshold) in their false negative rates. As in, conditioned on this finding, the main conclusion is that one should take care to use these in production at all, in particular in fake face detection. It would be different if, say, the detector did show a reliable FNR in general and for a privileged group, that instead does not generalize to other groups.

On the presentation, I felt some things were missing in context, that would strengthen or weaken the claim of relevance: (i) some research in the real world deployment of these deepfake detectors. One should consider that these should be part of a whole pipeline, ideally, and it would be interesting to know how this is applied in practice. I know this may be a bit out of the scope of the actual presentation, but it helps a lot to know the context in which these are applied in the real world. On the technical part, I felt strongly that what was missing was the per-group confidence intervals, not only the global ones. This is because a key proposal of the paper is to rank, for each specific group, their FNR, but they themselves admit that some group counts are too small to be reliable. For instance, for prithiv v1, the highest reported FNR group (black) has only n=12 samples, which the authors themselves flag as unreliable; a per-group confidence interval here would likely overlap most other groups and undercut the ranking the paper builds on.

I do not mean this to discourage the applicants, when saying that overall the problem is small in scope, but instead to help fuel their ambitions to tackle grander problems that may loom their entire country and the Global South, such as disempowerment, loss of control, catastrophic misuse, and so on. Instead, I raise the scope point precisely because I take this team seriously, and would hold them to the same bar as any researchers working on frontier problems. That being said, I will understand if the authors do not agree to my critique, and consider that instead these "small" problems are important to work on either way.

Cite this work

@misc {

title={

(HckPrj) Three Detectors, Three Failure Profiles: Detector-Specific Demographic Bias in Public Deepfake Detection and Its Implications for Content Moderation in Latin America

author={

jose ernesto ramirez GARCIA, Angel Pineda Aguirre, Daniel Alberto Curiel Vargas

date={

6/22/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026