Apr 27, 2026
Toxin Circuits in ESM-2: Mechanistic Interpretability Reveals Why Structure-Aware Probes Resist ProteinMPNN Redesign
Manan Wadhwa, Shivam Dubey
Background & Problem
Standard biosecurity screening uses sequence identity (BLAST).
ProteinMPNN redesigns toxin sequences below every BLAST threshold, achieving 0% detection across 723 redesigns.
Proposed Solution & Mechanism
A linear ESM-2 probe maintains 93.9% detection with no retraining.
Using interPLM Sparse Autoencoders (SAEs), 50 features are identified at 205× compression that explain probe performance.
These features are amplified by redesign (mean transfer ratio 1.28) because ProteinMPNN preserves structural fold topology—precisely what the circuit encodes.
Security Analysis & Evaluation
A four-tier attack taxonomy reveals the security boundary lies at gradient access: ProteinMPNN (6.1% evasion) vs. white-box attacks (100%).
Direct Probe Attribution identifies layer 32 as the bottleneck (r = 0.992 redesign–toxin circuit correlation).
SAE-based probes recover 38% of “Double-Evaders” that fool both BLAST and dense linear probes, demonstrating direction-sensitive detection beyond Euclidean boundaries.
Discoveries & Conclusion
Zero-shot scanning discovers 248 UniRef50 candidates enriched 4.75× for secreted signal peptides, including cross-kingdom fungal effectors (54% are currently annotated as “Uncharacterized” in UniProt).
The probe’s security guarantee equals the privacy of its weights.
Very interesting results and good progress for a hackathon weekend! I think people already suspected that pLMs could be quite helpful for that, but nice to see these numbers. I mostly wonder how much that changes with shorter sequences, though. This seems to be the crux, at least for me.
An overall very strong effort. The comparison to BLAST screening is compelling and well elucidated, the demonstration of utility of a simple linear probe on a frozen model is motivating, and the variety of experiments probing the nature of this detector are mostly compelling. However, the work would be improved via further consideration of what it means for the detector to be vulnerable to a "white box" gradient attack, and the writeup suffers from some internal inconsistencies (e.g. caption vs content in figure 1, different assertions about the number of double-evaders).
Cite this work
@misc {
title={
(HckPrj) Toxin Circuits in ESM-2: Mechanistic Interpretability Reveals Why Structure-Aware Probes Resist ProteinMPNN Redesign
},
author={
Manan Wadhwa, Shivam Dubey
},
date={
4/27/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


