Jul 28, 2025
Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization
Jeremias Ferrao, Ilija Lichkovski
3
In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.
Ari Brill
The idea of exploring developmental phase transitions measured using the LLC in a toy RL setting with a multi-component reward function is very interesting. The execution of the project is solid and the results are well-presented. It could be interesting for future work to explore the properties of training that give rise to clear LLC spikes (or not), as well to supplement this method by performing complementary mechanistic analyses.
Nikita Khomich
Very exciting result, clean repo which works and clear physics analogy as well as exciting interpretability direction. The only bit which I though is suspicious is the matching between LLC and rewards and standard error bars graph is a bit had wavy, it would be nice to have something numerical to confirm its statistically significant, but of course under time constraints its very impressive work.
Lucas Teixeira
AI Safety Relevance: Trends in scaling RL for reasoning models makes the study inductive biases and developmental paths in RL systems in a rigorous way a timely, and to my knowledge under-invested topic.
Innovation & Originality: This was a pretty straightforward extrapolation of ideas in SLT to RL: measure the LLC in RL systems. Unfortunately
Technical Rigor & Feasibility: SLT as it currently stands is applicable to supervised learning in an iid setting. Whether it can be extended to RL, where the data distribution is coupled to the policy and optimization is performed over a non-stationary loss landscape is an open theoretical problem, which is not addressed in this work. Furthermore, the authors use cross_entropy loss in their llc estimation, and thus are not probing the GRPO loss landscape. It's possible that these landscapes could be correlated in some way, thus lending more legitimacy to the results, but I haven't thought through this. It would've also been nice to have stronger evidence for the identification of the phase transitions. For example, part of what makes https://arxiv.org/pdf/2402.02364 compelling is that the stages of learning are, outside of a few hyperparameter choices, automatically identified (see Appendix B). Although visually compelling, it's unclear to me how exactly the inflection points in the llc were chosen. This is amplified by it's very dramatic non-monotonic character. It would have also been nice to have some behavioural evaluations which supported the qualitative claims attached to the differences in reward (i.e. "this suggests a consolidation of skill... this may suggest a brief period of exploration") as further evidence, but given time constraints of the hackathon it is reasonable to have elided such a step.
Lauren
This is a clearly presented and cool project mixing SLT (phase transition structure within learning landscape) with RL. The scope was appropriate for a hackathon, and the authors are up mostly up front about limitations and future work. While the results seem directionally good, the plots don’t always show alignment between the reward and LLC. However, the fact that these align with qualitative shifts in RL seem like a good indication that SLT is saying something about meaningful phase transitions during training, at least in this small testing ground. The safety relevance is mainly implicit (though strong), so a bit more about this could have been useful for framing. I'd be interested in follow up work that shows the potential and limitations of the LLC in realistic RL settings.
Jesse Hoogland
RL is important, so I'm happy you're trying to bridge SLT with RL (it's something we're working on ourselves)! This is actually a quite crisp setting in which to try to make that shift.
I like the idea of looking at multipart RL objectives in our analysis, but I'm skeptical from what I see here that you've found an interpretation of these phase transitions. I'm not even sure these are phase transitions in the precise sense of a tradeoff satisfying the free energy formula. I'd like to see the overall reward!
I'm also pretty skeptical about the actual LLC curve. I'd like to see loss traces and possibly other diagnostics/hyperparameter sweeps before I can conclude this is sensible. Two chains feels like too few. But LLC hyperparameter calibration is hard enough even before trying to get it to work for RL, so don't lose hope!
Cite this work
@misc {
title={
(HckPrj) Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization
},
author={
Jeremias Ferrao, Ilija Lichkovski
},
date={
7/28/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}