Jul 28, 2025

View Related Sprint

Momentum–Point-Perplexity Mechanics in Large Language Models

Lorenzo Tomaz, Thomas Jones

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Download

Review Project

See Code

Reviewer's Comments

Lauren

The project is ambitious and cool, with obvious physics and AI safety relevance. If conserved energy really does measure the trade-off between perplexity (in a sense, model uncertainty) and kinetic energy (in terms of the ‘velocity’ of the hidden state), these results can be turned into actionable control methods, but the idea needs a comb through and some serious stress-testing.

The write-up is overall well done, but there are some important edits that should be made. Some examples: ‘time’ is never defined (I assume this is the position in the context window, but it could also be through the layers, or after every training step). If it is the context position, this picture is somewhat muddied by the parallel processing that transformers carry out by design (though maybe there is the potential to treat this statistically instead of as a literal trajectory of the inference process). Sometimes the language (Lagrangian which contains log terms, or ‘log lagrangian) was confused. This led to some issues, including a bug in the code that treated the resultant energy as additive rather than multiplicative. While this doesn’t necessarily make the project worse, it does make the results harder to judge. I look forward to the follow up.

The appendices were also very repetitive in content, adding to an already pretty hefty problem of length. I would have liked to have seen one of these directions carefully fleshed out within the body of the paper, and left the rest for follow-up work.

Logan Riggs Smith

Interesting to define velocity as change in hiddent state across sequence position. One confounder is that a (potentially) large fraction of the difference can be explained by the token, especially at early layers. This work trains a per-token SAE, where the relevant part is training a per-token bias term to explain the residual stream (given the current token of course). It'd be good to see if the difference between trained & untrained isn't just the embedding. This could be checked by training this per-token bias &subtracting that from h_t at every step.

Although looking at your K/V definitions, it seems your log-kinetic energy is a measure of velocity (change in hiddent state) & your potential energy is a measure of perplexity. Obviously the perplexity will be different between trained & untrained, however, you state:

• On trained models, Kt ≈ 7, and Vt typically 0.2 to 2.5

• On untrained models, Vt may reach 6–8, with Kt relatively unchanged"

Meaning that the change in velocity is ~the same between the both. If it's the same, then why is it included in a ratio to differentiate them? The only difference is the perplexity, so I'm unsure on what results which rely on both of these metrics tells us about the difference between these models.

Why is SmolLM2-360M an outlier for the random weight models?

Unsure on protocal for testing minimality of effect of jacobian steering. For the stated protocol, I'd appreciate an example of the original output & the steered output, so I can understand what "higher semantic quality" of the top-10 tokens might mean.

But what do we actually want? I assume it's to "preserve semantic coherence" while steering in a direction, like steering towards spanish words, while maintaining meaning w/o devolving into repeating the same words. There is a paper that benchmarks steering by KL-divergence with proposed completions (eg finishing a sentence in spanish, or the original english equivalent) as well as other steering benchmark papers that might work. (Sorry I cannot find that paper atm).

Dmitry Vaintrob

This is potentially very good work. However, the attached code (kudos for including a repo) has a fundamental bug that makes the conservation results incorrect as tested. I don't know if these results will survive fixing bugs - but the idea is interesting.

The bug is that in the paper, the energy is defined as the product of a kinetic and potential term, whereas in the experiment (line 275 of their momentum_perplexity_phase_space_experiment.py) a sum is used: total_energy_point = kinetic_energy + point_potential

(where the LHS is the "full" kinetic energy and the RHS is the log of the probability). I think this is responsible for the unreasonably large CV scores (given that entropy of text has >1 *additive* standard deviation, it is not reasonable that the multiplicative energy would have the exact conservation properties implied by the CV). I'd be very interested in a corrected calculation here, which should be clearly baselined wrt the CV of the PPL and the logit gradient by themselves.

The paper claims to find a certain approximate conservation law in trained model token-to-token behavior. The paper also looks at random-weight models, where I strongly suspect the effects are spurious law-of-large-numbers style effects that don't depend on the specific energy function chosen; I'll ignore this for the present review.

For trained models, the claimed energy conservation implies language models trade off uncertainty of text prediction (roughly correlated with next-token entropy) vs. speed of token-to-token movement in the residual stream. When a model is unsure about the next token (it is claimed), its final-layer residual stream moves less than when it is confident of the completion. Moreover, the paper claims that the terms (probability assigned to the output token) and (movement in the residual stream) are roughly proportional in a single text generating instance, with the proportionality constant playing the role of a conserved energy-like parameter. This is very surprising, not least since (by construction) transformers are highly parallel and so one would not expect the difference between adjacent tokens to be a very meaningful measurement (especially since the last layer is to a large extent aligned with the embedding of the output token, which one would expect to be pretty stochastic).

Given these observations, the CV scores given are unreasonably large for the multiplicative energy, and are almost certainly a result of the bug. Indeed, entropy of text has >1 *additive* standard deviation, so it is not reasonable that the multiplicative energy would have the conservation properties implied by the (additive) CV; perhaps the claim is different, but it would have to be fixed. I'd be very interested in a corrected calculation here, noting that it should be clearly baselined wrt the CV of the PPL by and the logit gradient by themselves. For example I would like to see log graphs of the logit gradients and PPL of a few representative text samples.

One could imagine that there is some relationship between last-layer-embeddings and logit entropy which could contribute to some directional effect analogous to the paper's findings, but a stronger relationship that links token dynamics to prediction entropy would be genuinely surprising and interesting - it's interesting that their steering method improves quality. Again, this might be confounded in some way, but I'm looking forward to seeing the corrected results.

Nikita Khomich

Very exciting, really interesting result, clear physics analogy and new ideas which seem to work. I think the conservation is really interesting and generally one of the best papers.

Cite this work

@misc {

title={

(HckPrj) Momentum–Point-Perplexity Mechanics in Large Language Models=

author={

Lorenzo Tomaz, Thomas Jones

date={

7/28/25

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Jul 27, 2025

Constrained Belief Updates and Geometric Structures in Transformer Representations for the RRXOR Process

In this work, we seek to further understand the computational structure formed by transformers trained to predict the next token of a data set. In particular, we extend the work of (Piotrowski et al., 2025) by analyzing if and how these transformers implement constrained Bayesian updating, focusing on data generated by a Random-Random-XOR (RRXOR) process. Unlike the MESS3 processes studied originally in (Piotrowski et al., 2025), the RRXOR process leads to slightly more nuanced theoretical and practical models. Nevertheless, we verify some of the main claims of (Piotrowski et al., 2025). Namely, we find attention patterns that decay exponentially with distance. We further analyze the behavior of OV vectors.

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.

Careers

AI Safety Ideas

Code of Conduct

Responsible disclosure policy

hello@apartresearch.com