Mar 10, 2025
Searching for Universality and Equivariance in LLMs using Sparse Autoencoder Found Features
Meruyert Alimaganbetova, Jason Zeng
The project investigates how neuron features with properties of universality and equivariance affect the controllability and safety of large language models, finding that behaviors supported by redundant features are more resistant to manipulation than those governed by singular features.
I like this research direction a lot and think there's signal to explore re: redundant feature sets bolstering safety-relevant behaviors vs. singular features remaining more vulnerable. It would have been a more robust analysis if the team had been able to generalize across a larger dataset of prompt examples, and also more clearly defined linguistically a formula for "equivariance" or "singular"/"redundant" features (e.g., based on existing literature).
Cite this work
@misc {
title={
Searching for Universality and Equivariance in LLMs using Sparse Autoencoder Found Features
},
author={
Meruyert Alimaganbetova, Jason Zeng
},
date={
3/10/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


