Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

Jeremias Ferrao, Ilija Lichkovski

3

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Ari Brill

The idea of exploring developmental phase transitions measured using the LLC in a toy RL setting with a multi-component reward function is very interesting. The execution of the project is solid and the results are well-presented. It could be interesting for future work to explore the properties of training that give rise to clear LLC spikes (or not), as well to supplement this method by performing complementary mechanistic analyses.

Nikita Khomich

Very exciting result, clean repo which works and clear physics analogy as well as exciting interpretability direction. The only bit which I though is suspicious is the matching between LLC and rewards and standard error bars graph is a bit had wavy, it would be nice to have something numerical to confirm its statistically significant, but of course under time constraints its very impressive work.

Lucas Teixeira

AI Safety Relevance: Trends in scaling RL for reasoning models makes the study inductive biases and developmental paths in RL systems in a rigorous way a timely, and to my knowledge under-invested topic.

Innovation & Originality: This was a pretty straightforward extrapolation of ideas in SLT to RL: measure the LLC in RL systems. Unfortunately

Technical Rigor & Feasibility: SLT as it currently stands is applicable to supervised learning in an iid setting. Whether it can be extended to RL, where the data distribution is coupled to the policy and optimization is performed over a non-stationary loss landscape is an open theoretical problem, which is not addressed in this work. Furthermore, the authors use cross_entropy loss in their llc estimation, and thus are not probing the GRPO loss landscape. It's possible that these landscapes could be correlated in some way, thus lending more legitimacy to the results, but I haven't thought through this. It would've also been nice to have stronger evidence for the identification of the phase transitions. For example, part of what makes https://arxiv.org/pdf/2402.02364 compelling is that the stages of learning are, outside of a few hyperparameter choices, automatically identified (see Appendix B). Although visually compelling, it's unclear to me how exactly the inflection points in the llc were chosen. This is amplified by it's very dramatic non-monotonic character. It would have also been nice to have some behavioural evaluations which supported the qualitative claims attached to the differences in reward (i.e. "this suggests a consolidation of skill... this may suggest a brief period of exploration") as further evidence, but given time constraints of the hackathon it is reasonable to have elided such a step.

Lauren

This is a clearly presented and cool project mixing SLT (phase transition structure within learning landscape) with RL. The scope was appropriate for a hackathon, and the authors are up mostly up front about limitations and future work. While the results seem directionally good, the plots don’t always show alignment between the reward and LLC. However, the fact that these align with qualitative shifts in RL seem like a good indication that SLT is saying something about meaningful phase transitions during training, at least in this small testing ground. The safety relevance is mainly implicit (though strong), so a bit more about this could have been useful for framing. I'd be interested in follow up work that shows the potential and limitations of the LLC in realistic RL settings.

Jesse Hoogland

RL is important, so I'm happy you're trying to bridge SLT with RL (it's something we're working on ourselves)! This is actually a quite crisp setting in which to try to make that shift.

I like the idea of looking at multipart RL objectives in our analysis, but I'm skeptical from what I see here that you've found an interpretation of these phase transitions. I'm not even sure these are phase transitions in the precise sense of a tradeoff satisfying the free energy formula. I'd like to see the overall reward!

I'm also pretty skeptical about the actual LLC curve. I'd like to see loss traces and possibly other diagnostics/hyperparameter sweeps before I can conclude this is sensible. Two chains feels like too few. But LLC hyperparameter calibration is hard enough even before trying to get it to work for RL, so don't lose hope!

Cite this work

@misc {

title={

(HckPrj) Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

},

author={

Jeremias Ferrao, Ilija Lichkovski

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Read More

Jul 27, 2025

Constrained Belief Updates and Geometric Structures in Transformer Representations for the RRXOR Process

In this work, we seek to further understand the computational structure formed by transformers trained to predict the next token of a data set. In particular, we extend the work of (Piotrowski et al., 2025) by analyzing if and how these transformers implement constrained Bayesian updating, focusing on data generated by a Random-Random-XOR (RRXOR) process. Unlike the MESS3 processes studied originally in (Piotrowski et al., 2025), the RRXOR process leads to slightly more nuanced theoretical and practical models. Nevertheless, we verify some of the main claims of (Piotrowski et al., 2025). Namely, we find attention patterns that decay exponentially with distance. We further analyze the behavior of OV vectors.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.