Jul 27, 2025
A Mechanism for the Emergence of Superposition in Toy Models
Wu Shuin Jian, Choong Kai Zhe, Harshank Shrotriya
We conjecture that superposition, rather than a problem, is a key feature of neural networks that allows them to efficiently model high dimensional data with limited resources. We explore how Nonlinear Matrix Decomposition might provide insight into how superposition works and could lead to the creation of tools for interpretability.
Dmitry Vaintrob
The project observes the fact that the target of toy models of superposition is equivalent to the target of toy models of superposition. This simple but nice observations relates the two data properties of superposition and positive factorization (in general distinct and independently observed in ML data). I think this project would be significantly improved by an experiment: as the two approaches use very different learning algorithms, it would be quite interesting to compare their performance, the loss of Saul's algorithm in the TMS context, etc.
Note about the abstract : superposition is not considered a pathological behavior, but is specifically posited as a way that networks can compress sparse vectors. The connection between Saul's matrix factorization work and toy models of superposition is not (at least not in a clear way) indicative of either being a compression phenomenon, and the paper does not seem to motivate this.
Logan Riggs Smith
"This report argues that superposition is not a pathological behavior, but a
structural consequence of efficient representation under resource constraints."
I don't believe this is a controversial or novel claim. TMS has already made this claim & your paper seems to agree with that interpretation. To fix this (potential) contradiction, you could clarify what you mean by pathological & how TMS makes a different claim than above.
I did like the general connection of NMD w/ TMS as well as the step-loss function; however, it would've been good to include the step-loss function as a figure.
I am unsure on the usefulness of understanding how TMS is learned. TMS is reconstructing known features, whereas NN's are learning features, where to store them in a compressed way in order to predict the label. It's unclear if learning how TMS learns it's features will generalize to the real case.
Ari Brill
This entry proposes that superposition in neural networks can be described through the lens of nonlinear matrix decomposition (NMD), and finds empirical evidence that toy models of superposition and NMD produce very similar sets of overlapping features. This seems like a theoretical connection that could potentially be of interest for the mechanistic interpretability community. It would be great for this work to be extended further, and to see if NMD could inspire new interpretability methods that are competitive with sparse autoencoders.
Jennifer Lin
The authors propose that nonlinear matrix decomposition (NMD), an iterative method to obtain a low-rank approximation of a sparse nonnegative matrix, may be a useful lens to understand the phenomenon of superposition in neural networks. As evidence they show that in a small instance of the “Toy Models of Superposition” experiment (with six sparse features and three hidden neurons), the basis vectors of the rank 3 matrix \Theta that one learns upon applying NMD to compress the 6-row data matrix to rank 3 resemble the feature directions found by the TMS experiment. The authors also speculatively propose that the stepwise loss function for TMS with (n=6, m=3) that appears during training (as discussed in section 5 of the TMS paper) could be explained by the model effectively discovering a higher-rank NMD representation of the dataset at each step.
This work is conceptually strong. The proposed connection between superposition and NMD is new to me and backed by a crisp supporting experiment. The connection to AI safety is indirect (implicit through superposition being an open problem for mechanistic interpretability which in turn is arguably critical for AI safety), but this chain of logic is usually taken for granted in the interpretability community. The initial success of the idea in a small toy model is promising. I would however caution that this model is simple enough that many different interpretability proposals can be made to work in it, and positive results in larger TMS models (as a minimum that’s hopefully not too hard to run with existing code) as well as models beyond TMS would help to make a stronger case for a ‘smoking gun’ signal to pursue the project beyond the weekend hackathon. It could also be interesting to try to analytically understand why NMD and TMS match in the model that the authors studied.
Max Hennick
While the idea is interesting, and intuitively makes sense, the rigor is lacking. While they are discussed as conceptually similar, the experiments only provide visual justification and no quantitative metrics. Including such metrics would massively improve the quality of the paper.
Cite this work
@misc {
title={
(HckPrj) A Mechanism for the Emergence of Superposition in Toy Models
},
author={
Wu Shuin Jian, Choong Kai Zhe, Harshank Shrotriya
},
date={
7/27/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}