Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

Lorenzo Tomaz, Thomas Jones

2

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Interesting to define velocity as change in hiddent state across sequence position. One confounder is that a (potentially) large fraction of the difference can be explained by the token, especially at early layers. This work trains a per-token SAE, where the relevant part is training a per-token bias term to explain the residual stream (given the current token of course). It'd be good to see if the difference between trained & untrained isn't just the embedding. This could be checked by training this per-token bias &subtracting that from h_t at every step.

Although looking at your K/V definitions, it seems your log-kinetic energy is a measure of velocity (change in hiddent state) & your potential energy is a measure of perplexity. Obviously the perplexity will be different between trained & untrained, however, you state:

"

• On trained models, Kt ≈ 7, and Vt typically 0.2 to 2.5

• On untrained models, Vt may reach 6–8, with Kt relatively unchanged"

Meaning that the change in velocity is ~the same between the both. If it's the same, then why is it included in a ratio to differentiate them? The only difference is the perplexity, so I'm unsure on what results which rely on both of these metrics tells us about the difference between these models.

Why is SmolLM2-360M an outlier for the random weight models?

Unsure on protocal for testing minimality of effect of jacobian steering. For the stated protocol, I'd appreciate an example of the original output & the steered output, so I can understand what "higher semantic quality" of the top-10 tokens might mean.

But what do we actually want? I assume it's to "preserve semantic coherence" while steering in a direction, like steering towards spanish words, while maintaining meaning w/o devolving into repeating the same words. There is a paper that benchmarks steering by KL-divergence with proposed completions (eg finishing a sentence in spanish, or the original english equivalent) as well as other steering benchmark papers that might work. (Sorry I cannot find that paper atm).

This is potentially very good work. However, the attached code (kudos for including a repo) has a fundamental bug that makes the conservation results incorrect as tested. I don't know if these results will survive fixing bugs - but the idea is interesting.

The bug is that in the paper, the energy is defined as the product of a kinetic and potential term, whereas in the experiment (line 275 of their momentum_perplexity_phase_space_experiment.py) a sum is used: total_energy_point = kinetic_energy + point_potential

(where the LHS is the "full" kinetic energy and the RHS is the log of the probability). I think this is responsible for the unreasonably large CV scores (given that entropy of text has >1 *additive* standard deviation, it is not reasonable that the multiplicative energy would have the exact conservation properties implied by the CV). I'd be very interested in a corrected calculation here, which should be clearly baselined wrt the CV of the PPL and the logit gradient by themselves.

The paper claims to find a certain approximate conservation law in trained model token-to-token behavior. The paper also looks at random-weight models, where I strongly suspect the effects are spurious law-of-large-numbers style effects that don't depend on the specific energy function chosen; I'll ignore this for the present review.

For trained models, the claimed energy conservation implies language models trade off uncertainty of text prediction (roughly correlated with next-token entropy) vs. speed of token-to-token movement in the residual stream. When a model is unsure about the next token (it is claimed), its final-layer residual stream moves less than when it is confident of the completion. Moreover, the paper claims that the terms (probability assigned to the output token) and (movement in the residual stream) are roughly proportional in a single text generating instance, with the proportionality constant playing the role of a conserved energy-like parameter. This is very surprising, not least since (by construction) transformers are highly parallel and so one would not expect the difference between adjacent tokens to be a very meaningful measurement (especially since the last layer is to a large extent aligned with the embedding of the output token, which one would expect to be pretty stochastic).

Given these observations, the CV scores given are unreasonably large for the multiplicative energy, and are almost certainly a result of the bug. Indeed, entropy of text has >1 *additive* standard deviation, so it is not reasonable that the multiplicative energy would have the conservation properties implied by the (additive) CV; perhaps the claim is different, but it would have to be fixed. I'd be very interested in a corrected calculation here, noting that it should be clearly baselined wrt the CV of the PPL by and the logit gradient by themselves. For example I would like to see log graphs of the logit gradients and PPL of a few representative text samples.

One could imagine that there is some relationship between last-layer-embeddings and logit entropy which could contribute to some directional effect analogous to the paper's findings, but a stronger relationship that links token dynamics to prediction entropy would be genuinely surprising and interesting - it's interesting that their steering method improves quality. Again, this might be confounded in some way, but I'm looking forward to seeing the corrected results.

Very exciting, really interesting result, clear physics analogy and new ideas which seem to work. I think the conservation is really interesting and generally one of the best papers.

The project is ambitious and cool, with obvious physics and AI safety relevance. If conserved energy really does measure the trade-off between perplexity (in a sense, model uncertainty) and kinetic energy (in terms of the ‘velocity’ of the hidden state), these results can be turned into actionable control methods, but the idea needs a comb through and some serious stress-testing.

The write-up is overall well done, but there are some important edits that should be made. Some examples: ‘time’ is never defined (I assume this is the position in the context window, but it could also be through the layers, or after every training step). If it is the context position, this picture is somewhat muddied by the parallel processing that transformers carry out by design (though maybe there is the potential to treat this statistically instead of as a literal trajectory of the inference process). Sometimes the language (Lagrangian which contains log terms, or ‘log lagrangian) was confused. This led to some issues, including a bug in the code that treated the resultant energy as additive rather than multiplicative. While this doesn’t necessarily make the project worse, it does make the results harder to judge. I look forward to the follow up.

The appendices were also very repetitive in content, adding to an already pretty hefty problem of length. I would have liked to have seen one of these directions carefully fleshed out within the body of the paper, and left the rest for follow-up work.

Cite this work

@misc {

title={

(HckPrj) Momentum–Point-Perplexity Mechanics in Large Language Models=

},

author={

Lorenzo Tomaz, Thomas Jones

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.