May 5, 2026
SAEBER: Sparse Autoencoders for Biological Entity Risk
Michael Yu
Open weight protein design models might generate toxic virulent proteins. Current classifiers are accurate but not interpretable or explainable. In this work, we train Sparse Autoencoders (SAEs) on RFD3 and RF3, leading open source protein folding and design models. We find SAE features with meaningful correlation to toxicity and virulence, with the top classifier reaching 0.87 AUROC.
You, clearly, know how every element of this project works and how it can be improved which is really great to see. If I were you, I'd get into contact with mech-interp-for-bio-model researchers, this could be a great stepping stone to do this research full-time if you're thinking about going into AIxBio (I see you're a MATS Fellow so I assume you're more general AI safety).
The text was refreshing to read, I'd love to see what happens if you or someone else implements your Future Directions.
This project provides a strong proof-of-concept for applying mechanistic interpretability techniques to building better safeguards against protein design models. The progress made is very impressive given the limited time and resources. I would like to congratulate the author for this piece of work, well done!
The potential to better understand virulence predictions is particularly valuable for tracking high-risk use cases of protein design models, including identifying emerging trends and revealing novel threat models.
A richer description of the results or more explanation on the website would benefit the presentation, especially for readers less familiar with the topic (just have an LLM do it!). This is completely understandable given the constraints of working as a one-man team within a hackathon timeframe. Showing an example success case, such as a successful identification of a virulent motif, could be a powerful demonstration.
A really strong submission. Your novelty might be a bit overstated in places, as this type of work has been performed and is ongoing, but you do highlight that and and directly identify it is novel to apply it to RFD3/RF3. I also appreciate your limitations related to time and compute resources were very transparent and accurate, and this still represents an impressive amount of work done technically well for a hackathon.
The experiments that were performed were performed well with proper controls. This is good, rigorous ML research.
Notably the finding that RFD3 memorizes family folds (ie. block 6 → near random under clustering) is genuinely interesting and biosecurity relevant. It potentially implies the model's safety properties depend on whether the input distribution overlaps with training families. Though n=44 per fold under clustering is also fragile, but again makes sense with the time and compute restraints. But this is nice groundwork to be followed up on in future studies. This really does deserve a more in-depth exploration and is a tantalizing finding.
The block 12 RFD3 cluster-split finding is interesting and the polysemanticity untangling interpretation is plausible but it falls into the same category as a nice hypothesis or foundation worth following up on.
Layer selection is admittedly ad hoc. You flagged this and proposed the knockout-pLDDT alternative for future work, I think that is a great idea and the correct move.
The headline 0.817 vs SOTA 0.92 gap is larger than the text suggests. I would say this is getting close but 'within striking distance' may be a bit strong.
Cite this work
@misc {
title={
(HckPrj) SAEBER: Sparse Autoencoders for Biological Entity Risk
},
author={
Michael Yu
},
date={
5/5/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


