Apr 24, 2026
The Protein ID Card: A Semantically Aware Screening Framework for Pathogenic Sequence Detection Using ESM-2 and FAISS
Guillaume Zahnd
The emergence of AI-powered biological design tools has necessitated a shift in biosecurity from sequence-alignment methods to function-prediction-based analysis. Current DNA screening protocols relying on BLAST are increasingly vulnerable to de novo designed sequences that evade similarity thresholds while retaining pathogenic functionality. We propose a scalable biosecurity screening pipeline that utilizes the ESM-2 transformer architecture to extract deep biological features from viral sequences, followed by similarity retrieval using FAISS. We implement a multi-class classification scheme to generate a functional "ID card" for proteins across five axes: Baltimore classification, Molecular function, Host category, Cellular tropism, and Zoonotic potential. Evaluating 15,154 viral samples from UniProt, our approach achieves a superior aggregate F1-score of 0.89 compared to 0.77 for BLAST. The embedding-based pipeline demonstrates significant performance gains in complex domains such as Host category (0.94) and Cellular Tropism (0.98), where sequence identity often fails to reflect biological roles. These results indicate that high-dimensional embeddings successfully capture the structural and functional constraints of viral evolution, providing a robust, semantically aware guardrail for modern biosecurity.
The problem this paper addresses is real and important. The field is excited about moving towards function-based approaches to sequence screening, away from BLAST-based screening. The five-axis ID card framing is intuitive, and the ESM-2 plus FAISS pipeline is well-implemented for what it actually does. Some issues that stand out :
- The introduction frames this as a defense against AI-designed sequences that evade similarity-based screening. But the entire evaluation uses reviewed UniProt proteins, well-characterized, well-annotated sequences that are almost certainly similar to ESM-2's pretraining data. No de novo designed sequences are tested anywhere. The core threat model is stated but never evaluated, which means the biosecurity claim rests entirely on inference rather than evidence. Granted, evaluating on genuinely de novo designed sequences is extremely hard; wet lab validation is not a realistic ask, and even computationally generating good test cases is non-trivial. But this should be explicitly acknowledged as a core limitation rather than left implicit.
- The BLAST comparison overstates the contribution. BLAST is a sequence similarity tool — it was never designed to predict host category or cellular tropism. Beating BLAST at functional classification is not a meaningful benchmark. The right comparison is against purpose-built protein function classifiers, several of which exist in the literature; frameworks like PROBE explicitly benchmark ESM-2 embeddings on function prediction tasks and would have been the appropriate baseline. The aggregate F1 improvement of 0.89 vs 0.77 is presented as the headline result but it doesn't answer the question the paper asks.
-There's also a data leakage concern worth flagging. Labels were derived from UniProt metadata, and ESM-2 was pretrained on UniProt sequences. The model may be partially recovering annotations it was exposed to during pretraining rather than genuinely learning functional biology from sequence alone. This is unacknowledged.
- The tropism result warrants caution. After excluding 13,985 sequences as uninformative, only 1,169 samples remained across four categories, for a total of roughly 117 test samples. At that scale, one or two misclassifications swing the F1 significantly. A result of 0.98 on approximately 25 examples per class is not robust enough to draw strong conclusions from.
-The zoonotic result is the most honest and interesting part of the paper. BLAST outperforms the embedding approach here (0.83 vs 0.80) and the explanation — that zoonotic potential is tied to conserved sequence signatures that local alignment captures better than semantic embeddings — is well-reasoned and adds genuine nuance.
For this approach to actually catch novel dangerous proteins, you'd need training data that explicitly labels dangerous function, not just taxonomy and host category. That data doesn't exist publicly, and curating it would itself be an infohazard. The paper doesn't engage with this at all, which is the most important limitation it leaves unaddressed.
What's been built is a strong functional annotator for known viral protein space. That has real utility — it outperforms BLAST on classifying divergent but known sequences, which matters for surveillance of natural variation. That's a legitimate contribution, just a narrower one than claimed.
This was a really great problem to pick and the approach itself was reasonable, but the project fell short in a couple of areas for me:
1) The defining problem is how do we catch de novo designed sequences with screening tools that can act on sequences alone. The work only tested the screening pipeline on known sequences, and did not test or discuss how well this might generalize to out-of-sample (i.e. truly de novo) sequences that could have the same function. My guess would be: it doesn’t.
2) The compelling approach to this problem (which was rightly identified by the author), was to try and use AI tools to predict functionality from sequence alone. However, the work only did this in a fairly narrow sense (specifically in the ‘Function’ prediction task) while the rest of the classification tasks were merely that - classifying various descriptors of the virus sequence, such as host and tropism. These are not really functions per se, and many aren’t really relevant for predicting whether a sequence is dangerous (e.g. Baltimore classification. Also, many of these classification tasks seemed kinda simple (as evidenced by BLAST pretty much getting it right across the board). This data would be a grind to gather for 1000s of virus sequences, but more directly relevant data to predicting how dangerous a sequence is would be e.g. human cell infection, viral titre, entry assay data, fusion assay data, genetic stability, mutation rate, glycan usage, immune evasion, host protein binding sites etc.
3) Conclusions were overstated (0.89 vs 0.77 for simple classification tasks doesn't seem like a ‘significant advancement’ to me), and there were some pretty generic/recycled explanations of how high-dimensional embeddings have magically captured ‘billions of years of evolution’. Several assertions were off the mark in a biological sense - as a practising virologist it was news to me that there are special ‘Zoonosis motifs’ that can predict zoonotic potential. If only!
4) There was little to no discussion of limitations or caveats. I would have expected to see some kind of model validation metrics or other things to make sure your models are not overfitting to the prediction tasks, for example. Would have been really great to mention the limitation that this pipeline would not necessarily work for true de novo sequences.
Sorry if I sounded mean - I think overall you were on the right track! It was a great problem to pick, I think the methodology was sound in principle and you presented the results clearly and efficiently.
This submission is well structured, clearly written, and uses visualization where appropriate, all to convey the ideas and findings very well. I appreciate the novel approach to synthesis screening, clearly accounting for de novo AI-enabled design. This is the way synthesis screening will have to take in the future. However, as acknowledged in the paper, a discussion of what to do with the ID card is lacking. This will be crucial to turning this submission's approach into an effective screening tool. For example, follow-up work may use expert surveys to determine which sequences should be flagged, especially how to trade off between different risk factors.
Cite this work
@misc {
title={
(HckPrj) The Protein ID Card: A Semantically Aware Screening Framework for Pathogenic Sequence Detection Using ESM-2 and FAISS
},
author={
Guillaume Zahnd
},
date={
4/24/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


