Apr 27, 2026
USING EMBEDDINGS AS A PROXY FOR FUNCTIONALITY IN DNA SCREENING
Kailer Laino
Screening tool that uses LLM embeddings to determine if a protein sequence is functionally similar. Performs much better than BLAST and shows that the embedding space for models trained on proteins seems to be very related to the space in which proteins are functionally similar.
The project correctly identifies that screening based on sequence homology has large blindspots, and that embedding-based screening could solve some of these. I think it's a useful proof-of-concept.
I thought there were three main things that don't quite work. First, the variants were generated by conservative amino acid substitutions within biochemically similar groups. ESM-2 was trained on such sequences because this is what happens during evolution. So, ESM-2 giving high cosine similarity to these variants is largely a consequence of experimental design, not evidence that it detects function. So, I don't think this is a great proxy for function, given how you designed sequences. You have shown that ESM-2 detects biochemical similarity, which is almost guaranteed given how the variants were made. It would be cool to see how a generating variants with ProteinMPNN affects ESM-2-based detection.
Second, you don't have specificity as a metric. Sensitivity alone doesn't mean much for a detector.
Third, your baseline is a bit too low, a more realistic one would have been to run the sequences through commec.
Using protein language models to generate measures of similarity between sequences seems like a natural and sensible approach to me. (I am not an expert in DNA synthesis screening). I can imagine it fitting into a wider set of algorithms run as a part of the screening pipeline, and that it would improve accuracy. I found the write-up quite clear, and appreciated the effort put into validation and empirical results.
Cite this work
@misc {
title={
(HckPrj) USING EMBEDDINGS AS A PROXY FOR FUNCTIONALITY IN DNA SCREENING
},
author={
Kailer Laino
},
date={
4/27/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


