Apr 27, 2026

Probing Harm-Related Signals in Pretrained Protein Language Models

Edric Castel C. Hao, Katrina Compendio, Rowell Herrera, Ciaran O'Connell, Alessandro Lucatelli

Protein safety classifiers, used to flag toxic, virulent, or otherwise hazardous sequences, are increasingly built on top of pretrained protein language models (PLMs), yet little is known about how these models represent harm-related properties or what this implies for their reliability. We probe ESM-2 (esm2_t6_8M_UR50D) on three binary classification tasks of biosecurity relevance: peptide toxicity, pore-forming toxin (PFT) identity, and virulence. Using mass-mean and logistic-regression linear probes applied to CLS-token activations at every transformer layer, we report three findings. First, the pretrained backbone, which has never seen harm-related labels, already encodes these tasks in a form that is largely linearly recoverable, with probe accuracies typically reaching ~75–90% in intermediate and deeper layers. Second, fine-tuning yields its largest gains on the mass-mean probe rather than the logistic-regression probe, suggesting that it primarily improves alignment of existing task-relevant structure with class-mean directions, rather than substantially increasing linear separability. Third, zero-shot cross-task evaluation reveals partial but non-trivial transfer among the three tasks, consistent with shared underlying structure, with virulence-trained models generalizing most broadly and PFT-trained models producing an inversely correlated signal on general toxicity. These results suggest that current PLM-based safety classifiers may rely heavily on pre-existing, linearly accessible representations, potentially limiting robustness to distribution shift or adversarially constructed sequences. While linear probes demonstrate that harm-related information is present in model representations, they do not establish that deployed classifiers causally depend on the same features. Taken together, our findings highlight both the promise and limitations of PLM-based safety screening and motivate further work on robustness and failure modes.

Review Project

See Code

View Related Sprint

Reviewer's Comments

Great work on mechansistic interpretability. How do representations drive modelb ehavior?

It's a good diagnostic paper, asking a great question from a biosecurity perspective: what information does a protein language model already “learn” about its harmful properties? The linear probing methodology used here is valid and appropriate, and the discovery that there is harmful information already embedded within the pretrained representation and fine-tuning simply modifies separability is intriguing, especially with the additional cross-task analysis indicating that it's the same structure being modified. Limitations are identified in the work itself, making it more credible.

What can improve: the novelty of the work is questionable, in that linear probing has been done before, and the work does not expand the methodology used. It's not clear how this research leads to objective changes and impact. Specifically, while robustness issues are discussed, no example of potential failure points such as adversarial attacks on the classifier are provided. Further testing could demonstrate the robustness problem in more detail. It's not quite clear how this applies to existing protein classifier systems, and an experiment would be nice here. Lastly, the paper lacks any specific recommendations on how safety pipelines could be improved with these results.

As someone with only a glancing knowledge of computational biology, I feel underqualified to speak to the exact methodological design choices made by the authors. That being said, I believe that it is important to gain a better understanding of how PLMs "understand" elements of harm in order to both assess and improve their robustness against evasion attempts. This paper provides valuable indications with respect to the representations of harm in ESM-2 and sets the stage for further research in this area.

Cite this work

@misc {

title={

(HckPrj) Probing Harm-Related Signals in Pretrained Protein Language Models

author={

Edric Castel C. Hao, Katrina Compendio, Rowell Herrera, Ciaran O'Connell, Alessandro Lucatelli

date={

4/27/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Apr 27, 2026