A Mechanism for the Emergence of Superposition in Toy Models

Wu Shuin Jian, Choong Kai Zhe, Harshank Shrotriya

We conjecture that superposition, rather than a problem, is a key feature of neural networks that allows them to efficiently model high dimensional data with limited resources. We explore how Nonlinear Matrix Decomposition might provide insight into how superposition works and could lead to the creation of tools for interpretability.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Dmitry Vaintrob

The project observes the fact that the target of toy models of superposition is equivalent to the target of toy models of superposition. This simple but nice observations relates the two data properties of superposition and positive factorization (in general distinct and independently observed in ML data). I think this project would be significantly improved by an experiment: as the two approaches use very different learning algorithms, it would be quite interesting to compare their performance, the loss of Saul's algorithm in the TMS context, etc.

Note about the abstract : superposition is not considered a pathological behavior, but is specifically posited as a way that networks can compress sparse vectors. The connection between Saul's matrix factorization work and toy models of superposition is not (at least not in a clear way) indicative of either being a compression phenomenon, and the paper does not seem to motivate this.

Logan Riggs Smith

"This report argues that superposition is not a pathological behavior, but a

structural consequence of efficient representation under resource constraints."

I don't believe this is a controversial or novel claim. TMS has already made this claim & your paper seems to agree with that interpretation. To fix this (potential) contradiction, you could clarify what you mean by pathological & how TMS makes a different claim than above.

I did like the general connection of NMD w/ TMS as well as the step-loss function; however, it would've been good to include the step-loss function as a figure.

I am unsure on the usefulness of understanding how TMS is learned. TMS is reconstructing known features, whereas NN's are learning features, where to store them in a compressed way in order to predict the label. It's unclear if learning how TMS learns it's features will generalize to the real case.

Ari Brill

This entry proposes that superposition in neural networks can be described through the lens of nonlinear matrix decomposition (NMD), and finds empirical evidence that toy models of superposition and NMD produce very similar sets of overlapping features. This seems like a theoretical connection that could potentially be of interest for the mechanistic interpretability community. It would be great for this work to be extended further, and to see if NMD could inspire new interpretability methods that are competitive with sparse autoencoders.

Jennifer Lin

The authors propose that nonlinear matrix decomposition (NMD), an iterative method to obtain a low-rank approximation of a sparse nonnegative matrix, may be a useful lens to understand the phenomenon of superposition in neural networks. As evidence they show that in a small instance of the “Toy Models of Superposition” experiment (with six sparse features and three hidden neurons), the basis vectors of the rank 3 matrix \Theta that one learns upon applying NMD to compress the 6-row data matrix to rank 3 resemble the feature directions found by the TMS experiment. The authors also speculatively propose that the stepwise loss function for TMS with (n=6, m=3) that appears during training (as discussed in section 5 of the TMS paper) could be explained by the model effectively discovering a higher-rank NMD representation of the dataset at each step.

This work is conceptually strong. The proposed connection between superposition and NMD is new to me and backed by a crisp supporting experiment. The connection to AI safety is indirect (implicit through superposition being an open problem for mechanistic interpretability which in turn is arguably critical for AI safety), but this chain of logic is usually taken for granted in the interpretability community. The initial success of the idea in a small toy model is promising. I would however caution that this model is simple enough that many different interpretability proposals can be made to work in it, and positive results in larger TMS models (as a minimum that’s hopefully not too hard to run with existing code) as well as models beyond TMS would help to make a stronger case for a ‘smoking gun’ signal to pursue the project beyond the weekend hackathon. It could also be interesting to try to analytically understand why NMD and TMS match in the model that the authors studied.

Max Hennick

While the idea is interesting, and intuitively makes sense, the rigor is lacking. While they are discussed as conceptually similar, the experiments only provide visual justification and no quantitative metrics. Including such metrics would massively improve the quality of the paper.

Cite this work

@misc {

title={

(HckPrj) A Mechanism for the Emergence of Superposition in Toy Models

},

author={

Wu Shuin Jian, Choong Kai Zhe, Harshank Shrotriya

},

date={

7/27/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Read More

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.