Apr 27, 2026
Probing Harm-Related Signals in Pretrained Protein Language Models
Edric Castel C. Hao, Katrina Compendio, Rowell Herrera, Ciaran O'Connell, Alessandro Lucatelli
Protein safety classifiers, used to flag toxic, virulent, or otherwise hazardous sequences, are increasingly built on top of pretrained protein language models (PLMs), yet little is known about how these models represent harm-related properties or what this implies for their reliability. We probe ESM-2 (esm2_t6_8M_UR50D) on three binary classification tasks of biosecurity relevance: peptide toxicity, pore-forming toxin (PFT) identity, and virulence. Using mass-mean and logistic-regression linear probes applied to CLS-token activations at every transformer layer, we report three findings. First, the pretrained backbone, which has never seen harm-related labels, already encodes these tasks in a form that is largely linearly recoverable, with probe accuracies typically reaching ~75–90% in intermediate and deeper layers. Second, fine-tuning yields its largest gains on the mass-mean probe rather than the logistic-regression probe, suggesting that it primarily improves alignment of existing task-relevant structure with class-mean directions, rather than substantially increasing linear separability. Third, zero-shot cross-task evaluation reveals partial but non-trivial transfer among the three tasks, consistent with shared underlying structure, with virulence-trained models generalizing most broadly and PFT-trained models producing an inversely correlated signal on general toxicity. These results suggest that current PLM-based safety classifiers may rely heavily on pre-existing, linearly accessible representations, potentially limiting robustness to distribution shift or adversarially constructed sequences. While linear probes demonstrate that harm-related information is present in model representations, they do not establish that deployed classifiers causally depend on the same features. Taken together, our findings highlight both the promise and limitations of PLM-based safety screening and motivate further work on robustness and failure modes.
Great work on mechansistic interpretability. How do representations drive modelb ehavior?
It's a good diagnostic paper, asking a great question from a biosecurity perspective: what information does a protein language model already “learn” about its harmful properties? The linear probing methodology used here is valid and appropriate, and the discovery that there is harmful information already embedded within the pretrained representation and fine-tuning simply modifies separability is intriguing, especially with the additional cross-task analysis indicating that it's the same structure being modified. Limitations are identified in the work itself, making it more credible.
What can improve: the novelty of the work is questionable, in that linear probing has been done before, and the work does not expand the methodology used. It's not clear how this research leads to objective changes and impact. Specifically, while robustness issues are discussed, no example of potential failure points such as adversarial attacks on the classifier are provided. Further testing could demonstrate the robustness problem in more detail. It's not quite clear how this applies to existing protein classifier systems, and an experiment would be nice here. Lastly, the paper lacks any specific recommendations on how safety pipelines could be improved with these results.
As someone with only a glancing knowledge of computational biology, I feel underqualified to speak to the exact methodological design choices made by the authors. That being said, I believe that it is important to gain a better understanding of how PLMs "understand" elements of harm in order to both assess and improve their robustness against evasion attempts. This paper provides valuable indications with respect to the representations of harm in ESM-2 and sets the stage for further research in this area.
Cite this work
@misc {
title={
(HckPrj) Probing Harm-Related Signals in Pretrained Protein Language Models
},
author={
Edric Castel C. Hao, Katrina Compendio, Rowell Herrera, Ciaran O'Connell, Alessandro Lucatelli
},
date={
4/27/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


