Apr 24, 2026

The Protein ID Card: A Semantically Aware Screening Framework for Pathogenic Sequence Detection Using ESM-2 and FAISS

Guillaume Zahnd

The emergence of AI-powered biological design tools has necessitated a shift in biosecurity from sequence-alignment methods to function-prediction-based analysis. Current DNA screening protocols relying on BLAST are increasingly vulnerable to de novo designed sequences that evade similarity thresholds while retaining pathogenic functionality. We propose a scalable biosecurity screening pipeline that utilizes the ESM-2 transformer architecture to extract deep biological features from viral sequences, followed by similarity retrieval using FAISS. We implement a multi-class classification scheme to generate a functional "ID card" for proteins across five axes: Baltimore classification, Molecular function, Host category, Cellular tropism, and Zoonotic potential. Evaluating 15,154 viral samples from UniProt, our approach achieves a superior aggregate F1-score of 0.89 compared to 0.77 for BLAST. The embedding-based pipeline demonstrates significant performance gains in complex domains such as Host category (0.94) and Cellular Tropism (0.98), where sequence identity often fails to reflect biological roles. These results indicate that high-dimensional embeddings successfully capture the structural and functional constraints of viral evolution, providing a robust, semantically aware guardrail for modern biosecurity.

Review Project

See Code

View Related Sprint

Reviewer's Comments

The problem this paper addresses is real and important. The field is excited about moving towards function-based approaches to sequence screening, away from BLAST-based screening. The five-axis ID card framing is intuitive, and the ESM-2 plus FAISS pipeline is well-implemented for what it actually does. Some issues that stand out :

- The introduction frames this as a defense against AI-designed sequences that evade similarity-based screening. But the entire evaluation uses reviewed UniProt proteins, well-characterized, well-annotated sequences that are almost certainly similar to ESM-2's pretraining data. No de novo designed sequences are tested anywhere. The core threat model is stated but never evaluated, which means the biosecurity claim rests entirely on inference rather than evidence. Granted, evaluating on genuinely de novo designed sequences is extremely hard; wet lab validation is not a realistic ask, and even computationally generating good test cases is non-trivial. But this should be explicitly acknowledged as a core limitation rather than left implicit.

- The BLAST comparison overstates the contribution. BLAST is a sequence similarity tool — it was never designed to predict host category or cellular tropism. Beating BLAST at functional classification is not a meaningful benchmark. The right comparison is against purpose-built protein function classifiers, several of which exist in the literature; frameworks like PROBE explicitly benchmark ESM-2 embeddings on function prediction tasks and would have been the appropriate baseline. The aggregate F1 improvement of 0.89 vs 0.77 is presented as the headline result but it doesn't answer the question the paper asks.

-There's also a data leakage concern worth flagging. Labels were derived from UniProt metadata, and ESM-2 was pretrained on UniProt sequences. The model may be partially recovering annotations it was exposed to during pretraining rather than genuinely learning functional biology from sequence alone. This is unacknowledged.

- The tropism result warrants caution. After excluding 13,985 sequences as uninformative, only 1,169 samples remained across four categories, for a total of roughly 117 test samples. At that scale, one or two misclassifications swing the F1 significantly. A result of 0.98 on approximately 25 examples per class is not robust enough to draw strong conclusions from.

-The zoonotic result is the most honest and interesting part of the paper. BLAST outperforms the embedding approach here (0.83 vs 0.80) and the explanation — that zoonotic potential is tied to conserved sequence signatures that local alignment captures better than semantic embeddings — is well-reasoned and adds genuine nuance.

For this approach to actually catch novel dangerous proteins, you'd need training data that explicitly labels dangerous function, not just taxonomy and host category. That data doesn't exist publicly, and curating it would itself be an infohazard. The paper doesn't engage with this at all, which is the most important limitation it leaves unaddressed.

What's been built is a strong functional annotator for known viral protein space. That has real utility — it outperforms BLAST on classifying divergent but known sequences, which matters for surveillance of natural variation. That's a legitimate contribution, just a narrower one than claimed.

This was a really great problem to pick and the approach itself was reasonable, but the project fell short in a couple of areas for me:

1) The defining problem is how do we catch de novo designed sequences with screening tools that can act on sequences alone. The work only tested the screening pipeline on known sequences, and did not test or discuss how well this might generalize to out-of-sample (i.e. truly de novo) sequences that could have the same function. My guess would be: it doesn’t.

2) The compelling approach to this problem (which was rightly identified by the author), was to try and use AI tools to predict functionality from sequence alone. However, the work only did this in a fairly narrow sense (specifically in the ‘Function’ prediction task) while the rest of the classification tasks were merely that - classifying various descriptors of the virus sequence, such as host and tropism. These are not really functions per se, and many aren’t really relevant for predicting whether a sequence is dangerous (e.g. Baltimore classification. Also, many of these classification tasks seemed kinda simple (as evidenced by BLAST pretty much getting it right across the board). This data would be a grind to gather for 1000s of virus sequences, but more directly relevant data to predicting how dangerous a sequence is would be e.g. human cell infection, viral titre, entry assay data, fusion assay data, genetic stability, mutation rate, glycan usage, immune evasion, host protein binding sites etc.

3) Conclusions were overstated (0.89 vs 0.77 for simple classification tasks doesn't seem like a ‘significant advancement’ to me), and there were some pretty generic/recycled explanations of how high-dimensional embeddings have magically captured ‘billions of years of evolution’. Several assertions were off the mark in a biological sense - as a practising virologist it was news to me that there are special ‘Zoonosis motifs’ that can predict zoonotic potential. If only!

4) There was little to no discussion of limitations or caveats. I would have expected to see some kind of model validation metrics or other things to make sure your models are not overfitting to the prediction tasks, for example. Would have been really great to mention the limitation that this pipeline would not necessarily work for true de novo sequences.

Sorry if I sounded mean - I think overall you were on the right track! It was a great problem to pick, I think the methodology was sound in principle and you presented the results clearly and efficiently.

This submission is well structured, clearly written, and uses visualization where appropriate, all to convey the ideas and findings very well. I appreciate the novel approach to synthesis screening, clearly accounting for de novo AI-enabled design. This is the way synthesis screening will have to take in the future. However, as acknowledged in the paper, a discussion of what to do with the ID card is lacking. This will be crucial to turning this submission's approach into an effective screening tool. For example, follow-up work may use expert surveys to determine which sequences should be flagged, especially how to trade off between different risk factors.

Cite this work

@misc {

title={

(HckPrj) The Protein ID Card: A Semantically Aware Screening Framework for Pathogenic Sequence Detection Using ESM-2 and FAISS

author={

Guillaume Zahnd

date={

4/24/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026