Jul 28, 2025

Exploration track: Interpreting LRMs

Suvajit Majumder, Aviral Kaintura

We probe the reason behind the recent success of reasoning models on complex tasks like Maths and coding. Focusing on reflection, backtracking and other analogous exploratory tokens in the response, we differentiate between behavior of base and reasoning finetuned checkpoints of Qwen2.5-Math-1.5b on Math500 dataset. We probe the internal representation differences between these, which can help steer model-training and inference time corrections in production.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The authors propose to study why reasoning models outperform conventional LLMs - a major open problem in mechanistic interpretability - by using interpretability toolkits to analyze how both types of models compute their outputs on a small selection of Math500 benchmark prompts. This appears to be fairly standard methodology. For a connection to physics, they suggest coarse-graining in the feature space and defining RG-based computation metrics with which to look for phase transitions. However, what they meant by this was not precisely defined. They attached results from several preliminary experiments. However, I found this paper hard to read because the plots for the experiments were not accompanied by textual explanations. For future work, my strong recommendation to the authors would be to spend more time explaining what was done (e.g. what exactly was measured by each experiment, why they chose to measure it and what the results imply) even if it comes at the cost of having significantly fewer experiments to show.

Detecting phase transition in llms during rl is not new but interesting, however I couldn't find any code, the GitHub link didnt work and the graphs are not explained at all. I am not sure im convinced in the result.

The project aims to evaluate internal representation differences between reasoning and base models for exploratory reasoning tasks in math, specifically reasoning retracing words, such as "but" and the confidence in these for each model. The team uses information theoretical tools and renorm group theory to bring in physics understanding for this task. I am not as well-versed as some of the other judges on QFT applied to NNs (but have read parts of the PBDT work) but it seems like the preliminary investigations are successful and that you can somewhat use QFT to get a preliminary understanding of where significant changes in this processing occurs. I'd love to see the finished version of this work, especially with concrete examples of a reasoning trace where you can predict that it will continue reasoning or something similar. If we get a deeper understanding than just looking at CoT, we may be able to solve key issues with model accountability. Also very curious to see models with English/Chinese readable CoT get compared with inscrutable CoTs (e.g. some DeepSeek outputs, I believe, use random characters to do processing) where we can then compare what sort of underlying NN dynamics emerge as a result.

This reads like a GPT‑drafted memo—broad questions, buzz‑heavy terminology, and unlabeled plots that do not obviously correspond to any described experiment. To become a viable hackathon project it needs a crisp research question, clearly defined variables/observables, and a clear safety link. The abstract and first few sections bring up everything and the kitchen sink, and the steps of the proposed project don’t make sense in terms of ordering, scope, and magnitude. Would you focus on an RG-inspired method (if so, which one)? What is the concrete reason you are focusing on ‘trigger words’? How are you measuring a phase transition? Why is any of this – concretely – important for identifying or preventing unsafe behaviors in LRMs?

The authors appear to have performed some interesting visualizations of the internal activations of reasoning models. Unfortunately, it seems like they haven't had time to explain the details of their experiments. For example, the top figure of page 5 appears to display some sort of PCA (axes are labelled "PC1" and "PC2") but doesn't explain what the PCA is over or why it is relevant to their project. This seems representative of most of the figures / results presented.

The topic of understanding reasoning models is important and timely, and it would be very interesting if physics-inspired methods could help shed light on the computations performed by these models. The proposal would have been significantly improved by including details of the proposed RG-based metrics, and by describing the methods used to generate the figures, which are presented largely without explanation.

Cite this work

@misc {

title={

(HckPrj) Exploration track: Interpreting LRMs

},

author={

Suvajit Majumder, Aviral Kaintura

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.