Apr 27, 2026
BioScreen: Function Aware Biological Sequence Screening with Mechanism of Harm Classification
Subramanyam Sahoo
BioScreen is a function-aware biological sequence screening system designed to address a key biosecurity weakness in sequence-similarity-based DNA synthesis screening: adversarially engineered proteins can preserve harmful function while drifting far from known threats in sequence space. The project fine-tunes ESM-2 3B with a multitask objective combining binary threat detection, mechanism-of-harm classification, supervised contrastive learning, and PGD adversarial training in embedding space, so that sequences cluster by biological effect rather than superficial resemblance. On a 2,913-sequence evaluation set, the production model reports AUROC 0.998 and AP 1.000, outperforming both a 3-mer BLAST proxy and pretrained ESM-2 similarity baselines, while also classifying seven mechanism-of-harm categories with 88.6% mean per-class accuracy. The system also includes certified robustness analysis via randomized smoothing and a deployment profile showing 38–45 sequences per second on a single NVIDIA H200 GPU, comfortably above commercial synthesis throughput needs. The main limitation is dependence on curated UniProt Swiss-Prot data, which may leave gaps for novel synthetic proteins or underrepresented threat classes; the author therefore positions BioScreen as a first-pass filter that should be paired with expert review for high-consequence cases.
The author identifies and works to a tackle a significant biosecurity challenge related to screening novel agents, and the notion of using AI to accelerate identification of harmful sequences seems useful. The technical assessment is beyond my personal knowledge to assess and a simplified explanation would help for a lay audience, but the nature of the contribution seems more geared towards technical audiences, so that's more a nice-to-have rather than a major problem.
You clearly stated the problem space and framed the work within it well, but a few things stood out that were unclear to me or unspecified, most importantly around the use of "function."
- In the statement "organises the embedding space so that functionally similar sequences cluster together regardless of sequence identity,"
how were functionally similar sequences validated as such? Were these related wild-type proteins from UnitProt, or were these the generated variants?
- If the adversarial variants were generated by ESM2 and also validated/evaluated against the model, wouldn't there be a bias impacting the results?
- How (and why) were those 7 expanded threat categories labels selected and defined?
As a potential limitation or future consideration: if considering that we should be sequence screening at multiple points of a design-build-test cycle, not just at synthesis, or considering where embedded solutions are needed, where do you see a solution like this falling in terms of performance, price, and more broadly, accessibility?
Cite this work
@misc {
title={
(HckPrj) BioScreen: Function Aware Biological Sequence Screening with Mechanism of Harm Classification
},
author={
Subramanyam Sahoo
},
date={
4/27/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


