Jul 28, 2025
Exploration track: Interpreting LRMs
Suvajit Majumder, Aviral Kaintura
We probe the reason behind the recent success of reasoning models on complex tasks like Maths and coding. Focusing on reflection, backtracking and other analogous exploratory tokens in the response, we differentiate between behavior of base and reasoning finetuned checkpoints of Qwen2.5-Math-1.5b on Math500 dataset. We probe the internal representation differences between these, which can help steer model-training and inference time corrections in production.
Andrew Mack
The authors appear to have performed some interesting visualizations of the internal activations of reasoning models. Unfortunately, it seems like they haven't had time to explain the details of their experiments. For example, the top figure of page 5 appears to display some sort of PCA (axes are labelled "PC1" and "PC2") but doesn't explain what the PCA is over or why it is relevant to their project. This seems representative of most of the figures / results presented.
Ari Brill
The topic of understanding reasoning models is important and timely, and it would be very interesting if physics-inspired methods could help shed light on the computations performed by these models. The proposal would have been significantly improved by including details of the proposed RG-based metrics, and by describing the methods used to generate the figures, which are presented largely without explanation.
Jennifer Lin
The authors propose to study why reasoning models outperform conventional LLMs - a major open problem in mechanistic interpretability - by using interpretability toolkits to analyze how both types of models compute their outputs on a small selection of Math500 benchmark prompts. This appears to be fairly standard methodology. For a connection to physics, they suggest coarse-graining in the feature space and defining RG-based computation metrics with which to look for phase transitions. However, what they meant by this was not precisely defined. They attached results from several preliminary experiments. However, I found this paper hard to read because the plots for the experiments were not accompanied by textual explanations. For future work, my strong recommendation to the authors would be to spend more time explaining what was done (e.g. what exactly was measured by each experiment, why they chose to measure it and what the results imply) even if it comes at the cost of having significantly fewer experiments to show.
Nikita Khomich
Detecting phase transition in llms during rl is not new but interesting, however I couldn't find any code, the GitHub link didnt work and the graphs are not explained at all. I am not sure im convinced in the result.
Esben Kran
The project aims to evaluate internal representation differences between reasoning and base models for exploratory reasoning tasks in math, specifically reasoning retracing words, such as "but" and the confidence in these for each model. The team uses information theoretical tools and renorm group theory to bring in physics understanding for this task. I am not as well-versed as some of the other judges on QFT applied to NNs (but have read parts of the PBDT work) but it seems like the preliminary investigations are successful and that you can somewhat use QFT to get a preliminary understanding of where significant changes in this processing occurs. I'd love to see the finished version of this work, especially with concrete examples of a reasoning trace where you can predict that it will continue reasoning or something similar. If we get a deeper understanding than just looking at CoT, we may be able to solve key issues with model accountability. Also very curious to see models with English/Chinese readable CoT get compared with inscrutable CoTs (e.g. some DeepSeek outputs, I believe, use random characters to do processing) where we can then compare what sort of underlying NN dynamics emerge as a result.
Lauren
This reads like a GPT‑drafted memo—broad questions, buzz‑heavy terminology, and unlabeled plots that do not obviously correspond to any described experiment. To become a viable hackathon project it needs a crisp research question, clearly defined variables/observables, and a clear safety link. The abstract and first few sections bring up everything and the kitchen sink, and the steps of the proposed project don’t make sense in terms of ordering, scope, and magnitude. Would you focus on an RG-inspired method (if so, which one)? What is the concrete reason you are focusing on ‘trigger words’? How are you measuring a phase transition? Why is any of this – concretely – important for identifying or preventing unsafe behaviors in LRMs?
Cite this work
@misc {
title={
(HckPrj) Exploration track: Interpreting LRMs
},
author={
Suvajit Majumder, Aviral Kaintura
},
date={
7/28/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}