Jul 27, 2025

Thermodynamics-inspired OOD Detection

Salmaan Barday, Gary Louw

A different approach to Thermodynamics-inspired Out-of-distribution detection Detection.

Download

Review Project

View Related Sprint

Reviewer's Comments

Andrew Mack

The authors present a method for out-of-distribution detection using pre-trained image classifiers. The way they've quantified performance is clear, and figure 1 suggests their method beats an energy-based baseline, which is exciting. The presentation could be clearer - I'm not sure how they define "blockwise energy" E_\ell(x) (I think it's based off projecting to logits for each layer but am not very certain). These results add to a growing body of work showing that intermediate-layer activations may be more useful for probing tasks than final layer activations (see e.g. https://arxiv.org/pdf/2412.09563v1). Confirming/refuting such theories seems like a nice goal for a hackathon.

Ari Brill

The project investigates energy-based out-of-distribution detection by studying an extension of that method, in which the energy is computed at every residual block. The approach makes sense and the results are presented clearly. However, no mention is made of AI safety, and the project would have benefited from articulating a clear connection between the research performed and AI safety challenges.

Esben Kran

Very interesting work! Seeing comparisons in OOD performance with other methods would indeed be interesting, as you note. The questions with new tools is always *whether they work*. In terms of detecting OOD samples better, this is really valuable and having a strong framework for doing this seems important for nearly all relevant real-world fields of NN use, such as medical and law, providing a heuristic for rejecting answers or at least providing human-readable warnings on the outputs.

However, I think there's another extension of all this work that would be really interesting: Whether we can evaluate the fuzzy boundaries of what is OOD to the models and what isn't. And whether we can see examples of algorithmic generalization (e.g. learning addition instead of memorizing results of single-digit addition tasks) through this method. I'm not certain how feasible it is, but having a proper evaluation of generalized OOD performance that transcends single tasks (i.e. through a benchmark-based method) seems incredibly important and like a feasible extension to the concepts and tools introduced in this work.

Max Hennick

While the method is interesting and a reasonable extension, the experimental setup is unclear and limits reproducibility. For example, it isn’t specified how intermediate block activations are mapped to class logits per block (auxiliary heads? a shared classifier? pooling + linear map?), yet the energy definition requires logits.

Lauren

Block‑wise energy pooling is a tidy proof‑of‑concept, but its benefit over the OOD detector from Liu et al. was unmotivated and unclear. A minimal result could have been to compare this with the reference, and to contextualize this a bit more within the literature (as it was the only reference used). I'm not convinced that this approach has any real implications for AI safety.

The authors could also have explained their experiments and plots more carefully.

Cite this work

@misc {

title={

(HckPrj) Thermodynamics-inspired OOD Detection

author={

Salmaan Barday, Gary Louw

date={

7/27/25

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.

Careers

AI Safety Ideas

Code of Conduct

Responsible disclosure policy

hello@apartresearch.com