Apr 26, 2026
Quantifying the Reconstruction Gap: A Dataset Bottleneck Analysis Framework for AI-Era Biosecurity Screening
Finomo Awajiogak Orom
Dataset Bottleneck Analysis (DBA) — Project Summary
Biosecurity screening removes dangerous biological sequences from public
databases, but a critical question remains unanswered: does removing those
sequences actually prevent an AI-equipped adversary from reconstructing them
using what remains? DBA is an open-source framework that answers this
question empirically.
We introduce a Redundancy Score (R ∈ [0, 1]) that measures how much of a
restricted sequence set can be reconstructed from the public corpus. Applied
to 4,844 real UniProt Swiss-Prot proteins with a cluster-aware split, DBA
reveals a striking result: while BLAST-style k-mer screening achieves
R = 0.064 (0% of sequences recoverable at ≥ 0.90 similarity), ESM-2 protein
language model embeddings achieve R = 0.847 — 13.2× higher — with 95.5% of
restricted sequences recoverable at the same threshold. This is the AI
threat multiplier: the factor by which language-model-aided adversaries
exceed the reconstruction potential assumed by sequence-identity policy.
The most alarming finding is the toxin experiment. K-mer screening makes
toxin proteins appear 64% safer than average (R = 0.023), creating a false
sense of security. ESM-2 reveals the opposite: toxin ESM-2 R = 0.873
(98.6% coverage), exceeding random proteins (0.847) and exposing a 32×
gap between what sequence-identity screening assumes and what a language
model adversary can actually recover.
DBA runs end-to-end in under 22 minutes on a laptop CPU with no GPU
required. It is designed as a pre-deployment audit tool for screening
programme designers: run it on your proposed screening category before
setting thresholds, or you may be calibrating against the wrong adversary.
The core finding is genuinely striking and easy to grasp which is that current screening doesn’t just underperform against AI-equipped adversaries, it could actively mislead. The framework is lightweight enough to actually get used. The weakness is that the central claim, that sequences scoring above 0.90 similarity in embedding space are recoverable, is asserted rather than demonstrated. That equivalence is doing a lot of work and it’s not obvious it holds for the properties that actually matter in a biosecurity context. The experiments also run on generic protein databases rather than the sequences that real screening programmes actually restrict, so the jump to a policy recommendation is a bigger leap than the paper acknowledges. One concrete fix would be to show that high ESM-2 similarity actually predicts functional equivalence for at least one relevant property, whether that’s toxicity, receptor binding, whatever is available. Without that the policy recommendation sits on an assumption.
Very interesting approach. The intersection of new protein language models and other biodesign tools with existing screening controls has not received much prior attention, to my knowledge. As the researchers reveal, this is an oversight because existing screening selection and calibration tools might give a misleading threat picture when considered in the context of new BD tools like ESM-2. This is an important vulnerability and the recommendation to use ESM-2 over k-mer is novel and valuable. Perhaps even more valuable is the R measure, which can be reapplied as new AI tools are released, in order to update screening selection. The project is well-thought out and well-executed. One thing that would make it a little stronger in terms of presentation would be to make the connection between the R measure and the recommendations clearer (especially for non-bioinformatics people like this reviewer). They might also provide general guidelines for applying this measure in future. Overall, am excellent project and valuable contribution to AI security.
Cite this work
@misc {
title={
(HckPrj) Quantifying the Reconstruction Gap: A Dataset Bottleneck Analysis Framework for AI-Era Biosecurity Screening
},
author={
Finomo Awajiogak Orom
},
date={
4/26/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


