Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

Jeremias Ferrao, Ilija Lichkovski

3

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

The idea of exploring developmental phase transitions measured using the LLC in a toy RL setting with a multi-component reward function is very interesting. The execution of the project is solid and the results are well-presented. It could be interesting for future work to explore the properties of training that give rise to clear LLC spikes (or not), as well to supplement this method by performing complementary mechanistic analyses.

Very exciting result, clean repo which works and clear physics analogy as well as exciting interpretability direction. The only bit which I though is suspicious is the matching between LLC and rewards and standard error bars graph is a bit had wavy, it would be nice to have something numerical to confirm its statistically significant, but of course under time constraints its very impressive work.

AI Safety Relevance: Trends in scaling RL for reasoning models makes the study inductive biases and developmental paths in RL systems in a rigorous way a timely, and to my knowledge under-invested topic.

Innovation & Originality: This was a pretty straightforward extrapolation of ideas in SLT to RL: measure the LLC in RL systems. Unfortunately

Technical Rigor & Feasibility: SLT as it currently stands is applicable to supervised learning in an iid setting. Whether it can be extended to RL, where the data distribution is coupled to the policy and optimization is performed over a non-stationary loss landscape is an open theoretical problem, which is not addressed in this work. Furthermore, the authors use cross_entropy loss in their llc estimation, and thus are not probing the GRPO loss landscape. It's possible that these landscapes could be correlated in some way, thus lending more legitimacy to the results, but I haven't thought through this. It would've also been nice to have stronger evidence for the identification of the phase transitions. For example, part of what makes https://arxiv.org/pdf/2402.02364 compelling is that the stages of learning are, outside of a few hyperparameter choices, automatically identified (see Appendix B). Although visually compelling, it's unclear to me how exactly the inflection points in the llc were chosen. This is amplified by it's very dramatic non-monotonic character. It would have also been nice to have some behavioural evaluations which supported the qualitative claims attached to the differences in reward (i.e. "this suggests a consolidation of skill... this may suggest a brief period of exploration") as further evidence, but given time constraints of the hackathon it is reasonable to have elided such a step.

This is a clearly presented and cool project mixing SLT (phase transition structure within learning landscape) with RL. The scope was appropriate for a hackathon, and the authors are up mostly up front about limitations and future work. While the results seem directionally good, the plots don’t always show alignment between the reward and LLC. However, the fact that these align with qualitative shifts in RL seem like a good indication that SLT is saying something about meaningful phase transitions during training, at least in this small testing ground. The safety relevance is mainly implicit (though strong), so a bit more about this could have been useful for framing. I'd be interested in follow up work that shows the potential and limitations of the LLC in realistic RL settings.

RL is important, so I'm happy you're trying to bridge SLT with RL (it's something we're working on ourselves)! This is actually a quite crisp setting in which to try to make that shift.

I like the idea of looking at multipart RL objectives in our analysis, but I'm skeptical from what I see here that you've found an interpretation of these phase transitions. I'm not even sure these are phase transitions in the precise sense of a tradeoff satisfying the free energy formula. I'd like to see the overall reward!

I'm also pretty skeptical about the actual LLC curve. I'd like to see loss traces and possibly other diagnostics/hyperparameter sweeps before I can conclude this is sensible. Two chains feels like too few. But LLC hyperparameter calibration is hard enough even before trying to get it to work for RL, so don't lose hope!

Cite this work

@misc {

title={

(HckPrj) Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

},

author={

Jeremias Ferrao, Ilija Lichkovski

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.