Jul 28, 2025

Exploration track: Interpreting LRMs

Suvajit Majumder, Aviral Kaintura

We probe the reason behind the recent success of reasoning models on complex tasks like Maths and coding. Focusing on reflection, backtracking and other analogous exploratory tokens in the response, we differentiate between behavior of base and reasoning finetuned checkpoints of Qwen2.5-Math-1.5b on Math500 dataset. We probe the internal representation differences between these, which can help steer model-training and inference time corrections in production.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

The authors appear to have performed some interesting visualizations of the internal activations of reasoning models. Unfortunately, it seems like they haven't had time to explain the details of their experiments. For example, the top figure of page 5 appears to display some sort of PCA (axes are labelled "PC1" and "PC2") but doesn't explain what the PCA is over or why it is relevant to their project. This seems representative of most of the figures / results presented.

The topic of understanding reasoning models is important and timely, and it would be very interesting if physics-inspired methods could help shed light on the computations performed by these models. The proposal would have been significantly improved by including details of the proposed RG-based metrics, and by describing the methods used to generate the figures, which are presented largely without explanation.

The authors propose to study why reasoning models outperform conventional LLMs - a major open problem in mechanistic interpretability - by using interpretability toolkits to analyze how both types of models compute their outputs on a small selection of Math500 benchmark prompts. This appears to be fairly standard methodology. For a connection to physics, they suggest coarse-graining in the feature space and defining RG-based computation metrics with which to look for phase transitions. However, what they meant by this was not precisely defined. They attached results from several preliminary experiments. However, I found this paper hard to read because the plots for the experiments were not accompanied by textual explanations. For future work, my strong recommendation to the authors would be to spend more time explaining what was done (e.g. what exactly was measured by each experiment, why they chose to measure it and what the results imply) even if it comes at the cost of having significantly fewer experiments to show.

Detecting phase transition in llms during rl is not new but interesting, however I couldn't find any code, the GitHub link didnt work and the graphs are not explained at all. I am not sure im convinced in the result.

The project aims to evaluate internal representation differences between reasoning and base models for exploratory reasoning tasks in math, specifically reasoning retracing words, such as "but" and the confidence in these for each model. The team uses information theoretical tools and renorm group theory to bring in physics understanding for this task. I am not as well-versed as some of the other judges on QFT applied to NNs (but have read parts of the PBDT work) but it seems like the preliminary investigations are successful and that you can somewhat use QFT to get a preliminary understanding of where significant changes in this processing occurs. I'd love to see the finished version of this work, especially with concrete examples of a reasoning trace where you can predict that it will continue reasoning or something similar. If we get a deeper understanding than just looking at CoT, we may be able to solve key issues with model accountability. Also very curious to see models with English/Chinese readable CoT get compared with inscrutable CoTs (e.g. some DeepSeek outputs, I believe, use random characters to do processing) where we can then compare what sort of underlying NN dynamics emerge as a result.

This reads like a GPT‑drafted memo—broad questions, buzz‑heavy terminology, and unlabeled plots that do not obviously correspond to any described experiment. To become a viable hackathon project it needs a crisp research question, clearly defined variables/observables, and a clear safety link. The abstract and first few sections bring up everything and the kitchen sink, and the steps of the proposed project don’t make sense in terms of ordering, scope, and magnitude. Would you focus on an RG-inspired method (if so, which one)? What is the concrete reason you are focusing on ‘trigger words’? How are you measuring a phase transition? Why is any of this – concretely – important for identifying or preventing unsafe behaviors in LRMs?

Cite this work

@misc {

title={

(HckPrj) Exploration track: Interpreting LRMs

},

author={

Suvajit Majumder, Aviral Kaintura

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.