Problem Areas in Physics and AI Safety

We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.

Ari Brill

Researcher

Lauren Greenspan

Researcher

Title

Although there is a lot of good physics for AI literature out there, we take a predominantly ‘problem first’ approach. This is to avoid restricting solutions to specific physics fields, methods, and tools. We’re excited to reframe these old perspectives and find new ones!

To guide you in getting started, we’ve developed the following problem space for AI safety, as we see it. We welcome ideas outside of this list (which is by no means comprehensive), as long as they speak to both the AI safety problem and physics solution. Some ideas are more thought out than others, and many potential projects entered here are speculative, and we encourage participants to reach out to mentors and collaborators early and often to hone their ideas. We also stress that the boundaries of these areas are artificial and many projects will span multiple sections.

Problem Area 1: Bridging the Theory/Practice Gap for AI Interpretability

Mentors: Daniel Kunin, Andrew Mack, Garrett Merz, Eric Michaud, Paul Riechers, Adam Scherlis, Logan Smith, Dmitry Vaintrob

The crux: Neural networks are complex systems with broad and flexible representational power. Mechanistic interpretability researchers are actively pursuing methods to reverse-engineer a trained network by decomposing it into human-interpretable parts (for example, methods based on sparse dictionary learning). To date, these methods have met with mixed success, with a central issue being their lack of solid theoretical and conceptual foundations. Can physicists bridge the theory/practice gap? How can theoretical analyses of inductive biases, feature learning, and optimization in neural networks help us create better interpretability tools?

Problem Description:

It’s important to distinguish between ‘interpretability tools’ and ‘interpretability’ in general, and take a broader view (beyond ‘mechanistic’) in defining the latter. In particular, we wish to include more theoretical directions that may be seen as ‘computational interpretability’ (i.e., tracking belief states rather than neural circuits), ‘developmental interpretability’ (i.e., monitoring degeneracies in the learning landscape), or those built on more speculative techniques like renormalization. While mechanistic interpretability focuses on identifying which internal components work together to implement specific behaviors, these directions can help identify the kinds of explanations that could robustly describe and predict these behaviors. In particular, they could help us reach consensus on:

What counts as a mechanism or feature
How to distinguish real structure from phenomenological artifacts
A ‘model natural’ framework for interpreting models in terms of their own internal abstractions, not just human categories.

We stress that there are some excellent mechanistic interpretability toy model settings and tools (e.g., TransformerLens), and encourage their use to test hypotheses.

References Related to the Problem Space:

TAIS RFP: Research Areas | Open Philanthropy

Applications of white-box techniques
Finding feature representations
Toy models of interpretability

[2501.16496] Open Problems in Mechanistic Interpretability

Open problems in Mechanistic interpretability methods and foundations

Connection to Physics:

Physicists are historically good at closing the theory-practice gap, finding structure in messy data, and getting rid of irrelevant degrees of freedom at different levels of abstraction. They have experience tracking degrees of freedom, model parameters, and observables in controlled experiments. In NNs, this is essential for homing in on precise causal relationships between network internals.

Candidate Research Directions:

Predictive State Extraction from Neural Representations: Use techniques from computational mechanics (e.g., ε-machines) to identify causal states that relate to messy, real-world data distributions. Evaluate whether these states compress activation space and retain task-relevant predictive power.
- Transformers represent belief state geometry in their residual stream
- [2402.09090] Software in the natural world: A computational approach to hierarchical emergence
Coarse-graining of features via renormalization group (RG) techniques: Define, cluster, and probe features, quantifying relevance and abstraction across scale. Visualize a “flow” of features, akin to an RG trajectory. Some refs include:
Operationalize the natural abstraction hypothesis to develop empirical tests for the compression or predictability of learned abstractions.
- The Natural Abstraction Hypothesis: Implications and Evidence — LessWrong
Develop more realistic formal models of superposition that apply to real-world networks
- Toy Models of Superposition
- [2408.05451] Mathematical Models of Computation in Superposition
Signal Geometry via Optimal Transport Maps: Using Rigollet-style methods, compute optimal transport maps between layer activations in small transformers. Track how representations compress, spread, or drift across layers, and how this relates to learnability in transformers.
- [2407.18163] Statistical optimal transport
Feature Phase Transitions via SLT: Apply metrics from singular learning theory (e.g., real log canonical thresholds or learning coefficient estimates) to detect phase transitions in feature formation during grokking.
- Build on this paper: https://arxiv.org/abs/2504.18274
Information bottleneck (IB) and renormalization:
- Gaussian IB also operates as a “coarse-graining”: for a given point on IB curve, the optimal representation is computable from top K eigenvalues: https://arxiv.org/abs/2107.13700
- Maximally disentangled representations minimize total correlation. When does IB/RG drive this to happen naturally?
Embedding-space dynamics: How does an individual embedding vector or cluster of embedding vectors evolve over training? Study symmetries, attractors, flows, Noether invariants, “BERT-ological” geometry, etc.
Symbolic distillation: Symbolic regression techniques can be used to interpret messages in message passing neural networks (MPNNs): https://arxiv.org/abs/2006.11287. Can we apply this to other architectures or probe with hypernetworks?
Latent space geometry: Study the geometric properties of the latent features learned by sparse autoencoders (SAEs). What insight can we gain about the structure of natural data, and the strengths and/or limitations of existing mechanistic interpretability methods?

Problem Area 2: Learning, Inductive Biases, and Generalization

Mentors: Jesse Hoogland, Daniel Kunin, Garrett Merz, Adam Scherlis, Alok Singh, Dmitry Vaintrob

The crux: Large models often generalize far beyond the regimes where traditional statistical learning theory applies. They memorize noisy data, then suddenly grok patterns. They learn from context at inference time without parameter updates. These behaviors suggest the presence of strong inductive biases—implicit assumptions that shape learning dynamics and generalization—but we don’t yet know how to characterize or control them. Physicists, with their fluency in emergent behavior, dualities, and phase transitions, could offer the theoretical tools to explain why models generalize the way they do—and how we might shape that generalization.

Problem Description:

We want to understand how architecture, initialization, data distribution, and training procedures give rise to inductive biases—and in turn, how these biases control what is learned, when generalization occurs, and why models may fail.

This includes:

Generalization phenomena in overparameterized regimes (e.g., grokking, double descent)
In-context learning, where models appear to learn from inference-time data but the mechanism is poorly understood
Inductive biases from architecture, such as attention, equivariance, or modular structure
Memorization vs. generalization tradeoffs and phase transitions
Feature vs. kernel regimes, and how to interpolate between them

Our goal is to move beyond empirical curve-fitting and develop physically grounded, predictive models of the learning process—especially those that clarify when generalization will succeed, fail, or change character.

References Related to the Problem Space:

TAIS RFP: Research Areas | Open Philanthropy

Theoretical study of inductive biases
In-Context Learning (ICL) Is Black-Box (particularly Mesa-Optimization, effects of data distribution and shift, design choices)

TAIS RFP: Research Areas | Open Philanthropy

More Transparent Architectures

Connection to Physics:

In a sense, physics is the science of inductive bias. It tells us that the universe favors certain configurations over others—symmetric over asymmetric, low-energy over high-energy, local over nonlocal—and builds theories to explain why. Physicists are trained to ask: What regularities are baked into a system? What constraints guide its evolution? Which degrees of freedom matter, and which can be ignored? This makes them well-suited to reasoning about how generalization arises from structural and statistical biases hidden in the architecture, data, and training process.

Candidate Research Directions:

Unbalanced Initialization and Rapid Representation Learning: Extend experiments from the paper “Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning”
New models of grokking and phase transitions, using toy models/model organisms
- Grokking as a First Order Phase Transition
Interpolating between different theoretical or empirical regimes, including over & under parameterization, memorization & generalization, feature & kernel
Architectural Inductive Bias Diagnostics: perform controlled experiments to compare physics-informed architectures (KANs, PINNs, equivariant models) vs. standard MLPs/transformers.
Signal Propagation Geometry in Transformers: Use optimal transport maps to study how token or attention head activations evolve across layers. Quantify expansion/contraction and link this to generalization failures or bottlenecks.
Generalization in Bayesian statistical learning: Better understand the “Bayes quartet” (the different forms of generalization in Bayesian statistical learning) as studied in singular learning theory, and how these relate to the empirical generalization errors we encounter in practice.
Free energy in singular learning theory: Understand lower order terms in the free energy formula in SLT. What role does the constant term have?
Transferability of Features: Test the “Features are Fate” hypothesis: evaluate how early-formed features predict transfer learning performance. Use compressed regression or classification tasks to isolate effects.
Concept Emergence under Physical Priors (e.g., sparsity, locality): Train models with architectural bottlenecks (e.g., concept bottlenecks, layer constraints) and study which abstractions emerge. Evaluate compressiveness and reusability of intermediate concepts across tasks.
Learning on highly-structured data: Mathematical data offers a very good proving-ground for interpretability research (https://arxiv.org/abs/2308.15594).
- Certain subclasses of problem are consistently easier than others in many of these and related works. Can we formalize this idea, maybe with something like Kolmogorov complexity?
- Emergence of known symmetries in token embedding space geometry (as in grokking)
  - When is a model provably using incorrect heuristics vs a “true” symmetry?
- Repetition helps with generalization on mathematical data (https://arxiv.org/pdf/2410.07041)
Non-standard analysis for analyzing neural networks: Use the toolkit of non-standard analysis to simplify derivations analyzing scaling regimes and learning coefficients in neural networks.
- https://alok.github.io/2024/09/28/discontinuous-derivative/

Problem Area 3: Mathematical Models of Data Structure

Mentors: Martin Biehl, Daniel Kunin, Eric Michaud, Garrett Merz, Paul Riechers

The crux: Intelligent systems that generalize well often do so by internalizing a model of the world—a latent structure that supports abstraction, prediction, and decision-making. Misalignment arises when this internal model – gleaned from the data – encodes goals, beliefs, or causal relationships incorrectly.

Problem Description: Can we develop useful mathematical models to describe the data structure internally represented by generally intelligent AI systems? How can methods from physics help us better understand abstraction, generalization, world modeling, and transfer learning in AI systems? We seek to:

Identify when and how natural abstractions emerge from data
Model how different inductive biases interact with the underlying data structure
Understand how ontological shifts and representation instability relate to misalignment or deception

Connection to Physics: Physicists tackle a similar challenge every day: nature’s fundamental structure is unknown, but we can gradually uncover it by collecting and interpreting experimental data, guided by theory. To do so, physicists develop mathematical models that capture the essence of phenomena.

Candidate Research Directions:

Build toy models of emergent data structure using sparse graphs, hidden manifolds, hierarchical composition, percolation clusters, or other structured data generating processes
Investigate belief state geometry learned by transformers trained on sequential data generated by hidden Markov models and other processes
- Transformers represent belief state geometry in their residual stream
- Constrained belief updates explain geometric structures in transformer representations
Study toy models of tasks requiring discrete, hierarchical skills
- On the creation of narrow AI: hierarchy and nonlocality of neural network skills
Develop formal models of latent variables present in natural data
- Natural Latents: Latent Variables Stable Across Ontologies
Investigate methods of dimensionality reduction in high dimensional data spaces
Apply singular learning theory to understand degeneracies in learning from structured data
- Modes of Sequence Models and Learning Coefficients

Problem Area 4: Quantitative Bounds on AI Behavior

Mentors: Martin Biehl, Paul Riechers, Alok Singh

The crux: A central goal of technical AI safety is to rigorously evaluate alignment – even in regimes where misbehavior might be rare, hard to anticipate or understand, or overlooked during training. But today’s systems offer little in the way of provable guarantees. What is it possible to know about AI systems, their behavior, and various asymptotic limits?

Problem Description:

To what extent can we predict quantitative bounds on the behavior of powerful AI systems? The end goal is not just safer training, but safety guarantees that hold even when capabilities exceed current testing regimes.

Bounding model outputs in the presence of distributional shift or adversarial perturbations
Quantifying rare but catastrophic behaviors (e.g., heavy-tailed distributions)
Analyzing stability in multi-agent systems
Formalizing unpredictability via dynamical systems analysis or chaos theory
Connecting statistical uncertainty in model behavior to logical or agent-based reasoning frameworks

Connection to Physics:

Physicists routinely quantify the unknown, bounding system behavior under noise, drift, or instability. To support this, they have a good understanding of theoretical limits and how far from them the physical system of interest is.

References Related to the Problem Space:

TAIS RFP: Research Areas | Open Philanthropy

White-box estimation of rare misbehaviors (anomaly detection?)
Conceptual clarity about risks from powerful AI (for papers track, and also uncertainty estimation/limits of knowledge)

Candidate Research Directions:

Adversarial Example Detection Using Statistical Physics
- Use energy-based models to identify out-of-distribution inputs
Human-AI Feedback Loop Modeling
- Apply control theory to analyze stability in human-AI interactions - Does this only apply in online settings?
- Safe Reinforcement Learning Using Robust Control Barrier Functions
Estimating provable unpredictability and its relationship with chaos theory
Exploring connections between physical limits and agent foundations
- Robust delegation in embedded agency, e.g. instrumental convergence, Vingean reflection, Goodhart’s law
Understanding different regimes of stability in AI training
- Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
Modeling rare events or separating signal vs noise
- Estimating the Probabilities of Rare Outputs in Language Models | OpenReview
- Anomaly detection
Probing the impact of scale separation in guiding ‘safe’ asymptotic regions
- Separation of scales and a thermodynamic description of feature learning in some CNNs

Problem Area 5: Scaling Laws, Phase Transitions, and Emergence

Mentors: Jesse Hoogland, Eric Michaud, Garrett Merz, Alok Singh

The crux: AI performance improves as a power law (in model size, dataset size, or compute) across a wide variety of modalities, architectures, and tasks. Yet emergent capabilities and sudden training behavior that depend on scale seem to defy simple extrapolation. Is there an end to the ‘bitter lesson’?

Problem Description:

While scaling laws are widely observed, they are still poorly understood. We want to model the limits of the scaling hypothesis, understand transitions in representation and capability, and develop toy models where emergence can be rigorously analyzed.

Are observed power laws truly universal, or artifacts of particular model/data choices?
How can apparently smooth scaling laws be reconciled with the emergence of discrete capabilities (e.g., grokking)?

Connection to Physics:

In many ways, physics is the science of scale. Critical phenomena, universality, renormalization, and scale separation all help to explain, characterize, and predict behavior in physical systems. Can similar models of data, training, and inference provide intuition and grounding for empirical scaling laws?

References Related to the Problem Space:

[2404.09932] Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Effects of Scale on Capabilities are not well-characterized
Qualitative Understanding of Reasoning Capabilities is Lacking

Candidate Research Directions:

Scaling laws from the data manifold dimension
- Develop models of neural scaling that build on the insight of Sharma & Kaplan (2022) and Bahri et al. (2024)
Scaling laws from discrete power-law-distributed skills
- The Quantization Model of Neural Scaling
Transferability phase diagrams in the low-data limit
- Features are fate: a theory of transfer learning in high-dimensional regression
Controlled experiments and tracking more nuanced scaling behaviors
- Physics of Language Models
Models for the transience of capabilities
- The Transient Nature of Emergent In-Context Learning in Transformers
- In-Context Learning Strategies Emerge Rationally
Non-standard analysis as a formal language of scale
- Use hyperfinite numbers to perform mean field theory calculations more simply
- Elementary Calculus: An Infinitesimal Approach
Singular learning theory as a theoretical model for emergent capabilities

Potential Failure Modes

Over-idealization: Building elegant toy models that fail to scale or capture key complexities of real NNs.
Metaphorical use of physics: Using “renormalization,” “entropy,” or “causality” as metaphors rather than formal mechanisms, without being clear about this.
Conflating representation with explanation: Modeling a system's behavior mathematically without understanding the actual computational mechanisms at play.
Confusing the ‘fundamental’ with the operational: Behaviors may subtly depend on architecture and hyperparameter choice.
Over-symmetrization: Injecting physically natural symmetry where the data doesn’t support it.
Unrealistic scales: Analyzing tractable but unrealistic scaling regimes or assuming equilibrium states that real neural networks may never achieve during training.
Neglecting engineering realities: Ignoring that successful ML systems often work because of "ugly" engineering tricks, regularization hacks, and empirical tuning that have no clean physics interpretation.