Problem Areas in Physics and AI Safety

We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.

Ari Brill

Researcher

Lauren Greenspan

Researcher

Title
Title

Although there is a lot of good physics for AI literature out there, we take a predominantly ‘problem first’ approach. This is to avoid restricting solutions to specific physics fields, methods, and tools. We’re excited to reframe these old perspectives and find new ones! 

To guide you in getting started, we’ve developed the following problem space for AI safety, as we see it. We welcome ideas outside of this list (which is by no means comprehensive), as long as they speak to both the AI safety problem and physics solution. Some ideas are more thought out than others, and many potential projects entered here are speculative, and we encourage participants to reach out to mentors and collaborators early and often to hone their ideas. We also stress that the boundaries of these areas are artificial and many projects will span multiple sections.

Problem Area 1: Bridging the Theory/Practice Gap for AI Interpretability 

Mentors: Daniel Kunin, Andrew Mack, Garrett Merz, Eric Michaud, Paul Riechers, Adam Scherlis, Logan Smith, Dmitry Vaintrob

The crux: Neural networks are complex systems with broad and flexible representational power. Mechanistic interpretability researchers are actively pursuing methods to reverse-engineer a trained network by decomposing it into human-interpretable parts (for example, methods based on sparse dictionary learning). To date, these methods have met with mixed success, with a central issue being their lack of solid theoretical and conceptual foundations. Can physicists bridge the theory/practice gap? How can theoretical analyses of inductive biases, feature learning, and optimization in neural networks help us create better interpretability tools?

Problem Description: 

It’s important to distinguish between ‘interpretability tools’ and ‘interpretability’ in general, and take a broader view (beyond ‘mechanistic’) in defining the latter. In particular, we wish to include more theoretical directions that may be seen as ‘computational interpretability’ (i.e., tracking belief states rather than neural circuits), ‘developmental interpretability’ (i.e., monitoring degeneracies in the learning landscape), or those built on more speculative techniques like renormalization. While mechanistic interpretability focuses on identifying which internal components work together to implement specific behaviors, these directions can help identify the kinds of explanations that could robustly describe and predict these behaviors. In particular, they could help us reach consensus on: 

  • What counts as a mechanism or feature 

  • How to distinguish real structure from phenomenological artifacts

  • A ‘model natural’ framework for interpreting models in terms of their own internal abstractions, not just human categories.

We stress that there are some excellent mechanistic interpretability toy model settings and tools (e.g., TransformerLens), and encourage their use to test hypotheses. 

References Related to the Problem Space: 

 TAIS RFP: Research Areas | Open Philanthropy

  1. Applications of white-box techniques 

  2. Finding feature representations 

  3. Toy models of interpretability

[2501.16496] Open Problems in Mechanistic Interpretability

  1. Open problems in Mechanistic interpretability methods and foundations 

Connection to Physics: 

Physicists are historically good at closing the theory-practice gap, finding structure in messy data, and getting rid of irrelevant degrees of freedom at different levels of abstraction. They have experience tracking degrees of freedom, model parameters, and observables in controlled experiments. In NNs, this is essential for homing in on precise causal relationships between network internals. 

Candidate Research Directions: 

Problem Area 2: Learning, Inductive Biases, and Generalization 

Mentors: Jesse Hoogland, Daniel Kunin, Garrett Merz, Adam Scherlis, Alok Singh, Dmitry Vaintrob

The crux: Large models often generalize far beyond the regimes where traditional statistical learning theory applies. They memorize noisy data, then suddenly grok patterns. They learn from context at inference time without parameter updates. These behaviors suggest the presence of strong inductive biases—implicit assumptions that shape learning dynamics and generalization—but we don’t yet know how to characterize or control them. Physicists, with their fluency in emergent behavior, dualities, and phase transitions, could offer the theoretical tools to explain why models generalize the way they do—and how we might shape that generalization.

Problem Description: 

We want to understand how architecture, initialization, data distribution, and training procedures give rise to inductive biases—and in turn, how these biases control what is learned, when generalization occurs, and why models may fail.

This includes:

  • Generalization phenomena in overparameterized regimes (e.g., grokking, double descent)

  • In-context learning, where models appear to learn from inference-time data but the mechanism is poorly understood

  • Inductive biases from architecture, such as attention, equivariance, or modular structure

  • Memorization vs. generalization tradeoffs and phase transitions

  • Feature vs. kernel regimes, and how to interpolate between them

Our goal is to move beyond empirical curve-fitting and develop physically grounded, predictive models of the learning process—especially those that clarify when generalization will succeed, fail, or change character.

References Related to the Problem Space: 

 TAIS RFP: Research Areas | Open Philanthropy

  1. Theoretical study of inductive biases 

  2. In-Context Learning (ICL) Is Black-Box (particularly Mesa-Optimization, effects of data distribution and shift, design choices)

 TAIS RFP: Research Areas | Open Philanthropy

  1. More Transparent Architectures

Connection to Physics: 

In a sense, physics is the science of inductive bias. It tells us that the universe favors certain configurations over others—symmetric over asymmetric, low-energy over high-energy, local over nonlocal—and builds theories to explain why. Physicists are trained to ask: What regularities are baked into a system? What constraints guide its evolution? Which degrees of freedom matter, and which can be ignored? This makes them well-suited to reasoning about how generalization arises from structural and statistical biases hidden in the architecture, data, and training process. 

Candidate Research Directions: 

  • Unbalanced Initialization and Rapid Representation Learning: Extend experiments from the paper “Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning”

  • New models of grokking and phase transitions, using toy models/model organisms

  • Interpolating between different theoretical or empirical regimes, including over & under parameterization, memorization & generalization, feature & kernel 

  • Architectural Inductive Bias Diagnostics: perform controlled experiments to compare physics-informed architectures (KANs, PINNs, equivariant models) vs. standard MLPs/transformers. 

  • Signal Propagation Geometry in Transformers: Use optimal transport maps to study how token or attention head activations evolve across layers. Quantify expansion/contraction and link this to generalization failures or bottlenecks.

  • Generalization in Bayesian statistical learning: Better understand the “Bayes quartet” (the different forms of generalization in Bayesian statistical learning) as studied in singular learning theory, and how these relate to the empirical generalization errors we encounter in practice.

  • Free energy in singular learning theory: Understand lower order terms in the free energy formula in SLT. What role does the constant term have?

  • Transferability of Features: Test the “Features are Fate” hypothesis: evaluate how early-formed features predict transfer learning performance. Use compressed regression or classification tasks to isolate effects.

  • Concept Emergence under Physical Priors (e.g., sparsity, locality): Train models with architectural bottlenecks (e.g., concept bottlenecks, layer constraints) and study which abstractions emerge. Evaluate compressiveness and reusability of intermediate concepts across tasks.

  • Learning on highly-structured data: Mathematical data offers a very good proving-ground for interpretability research (https://arxiv.org/abs/2308.15594).

    • Certain subclasses of problem are consistently easier than others in many of these and related works. Can we formalize this idea, maybe with something like Kolmogorov complexity?

    • Emergence of known symmetries in token embedding space geometry (as in grokking)

      • When is a model provably using incorrect heuristics vs a “true” symmetry?

    • Repetition helps with generalization on mathematical data (https://arxiv.org/pdf/2410.07041)

  • Non-standard analysis for analyzing neural networks: Use the toolkit of non-standard analysis to simplify derivations analyzing scaling regimes and learning coefficients in neural networks.

Problem Area 3: Mathematical Models of Data Structure

Mentors: Martin Biehl, Daniel Kunin, Eric Michaud, Garrett Merz, Paul Riechers 

The crux: Intelligent systems that generalize well often do so by internalizing a model of the world—a latent structure that supports abstraction, prediction, and decision-making. Misalignment arises when this internal model – gleaned from the data – encodes goals, beliefs, or causal relationships incorrectly. 

Problem Description: Can we develop useful mathematical models to describe the data structure internally represented by generally intelligent AI systems?  How can methods from physics help us better understand abstraction, generalization, world modeling, and transfer learning in AI systems? We seek to:

  • Identify when and how natural abstractions emerge from data

  • Model how different inductive biases interact with the underlying data structure

  • Understand how ontological shifts and representation instability relate to misalignment or deception

Connection to Physics: Physicists tackle a similar challenge every day: nature’s fundamental structure is unknown, but we can gradually uncover it by collecting and interpreting experimental data, guided by theory. To do so, physicists develop mathematical models that capture the essence of phenomena. 

Candidate Research Directions:

Problem Area 4: Quantitative Bounds on AI Behavior 

Mentors: Martin Biehl, Paul Riechers, Alok Singh

The crux: A central goal of technical AI safety is to rigorously evaluate alignment – even in regimes where misbehavior might be rare, hard to anticipate or understand, or overlooked during training. But today’s systems offer little in the way of provable guarantees. What is it possible to know about AI systems, their behavior, and various asymptotic limits?

Problem Description: 

To what extent can we predict quantitative bounds on the behavior of powerful AI systems? The end goal is not just safer training, but safety guarantees that hold even when capabilities exceed current testing regimes.

  • Bounding model outputs in the presence of distributional shift or adversarial perturbations

  • Quantifying rare but catastrophic behaviors (e.g., heavy-tailed distributions)

  • Analyzing stability in multi-agent systems

  • Formalizing unpredictability via dynamical systems analysis or chaos theory

  • Connecting statistical uncertainty in model behavior to logical or agent-based reasoning frameworks

Connection to Physics:

Physicists routinely quantify the unknown, bounding system behavior under noise, drift, or instability. To support this, they have a good understanding of theoretical limits and how far from them the physical system of interest is. 

References Related to the Problem Space: 

 TAIS RFP: Research Areas | Open Philanthropy

  1. White-box estimation of rare misbehaviors (anomaly detection?) 

  2. Conceptual clarity about risks from powerful AI (for papers track, and also uncertainty estimation/limits of knowledge)

Candidate Research Directions: 

Problem Area 5:  Scaling Laws, Phase Transitions, and Emergence

Mentors: Jesse Hoogland, Eric Michaud, Garrett Merz, Alok Singh

The crux: AI performance improves as a power law (in model size, dataset size, or compute) across a wide variety of modalities, architectures, and tasks. Yet emergent capabilities and sudden training behavior that depend on scale seem to defy simple extrapolation. Is there an end to the ‘bitter lesson’? 

Problem Description:

While scaling laws are widely observed, they are still poorly understood.  We want to model the limits of the scaling hypothesis, understand transitions in representation and capability, and develop toy models where emergence can be rigorously analyzed.

  • Are observed power laws truly universal, or artifacts of particular model/data choices?

  • How can apparently smooth scaling laws be reconciled with the emergence of discrete capabilities (e.g., grokking)?

Connection to Physics: 

In many ways, physics is the science of scale. Critical phenomena, universality, renormalization, and scale separation all help to explain, characterize, and predict behavior in physical systems. Can similar models of data, training, and inference provide intuition and grounding for empirical scaling laws?

References Related to the Problem Space: 

[2404.09932] Foundational Challenges in Assuring Alignment and Safety of Large Language Models

  1. Effects of Scale on Capabilities are not well-characterized 

  2. Qualitative Understanding of Reasoning Capabilities is Lacking 

Candidate Research Directions: 

Potential Failure Modes

  • Over-idealization: Building elegant toy models that fail to scale or capture key complexities of real NNs.

  • Metaphorical use of physics: Using “renormalization,” “entropy,” or “causality” as metaphors rather than formal mechanisms, without being clear about this.

  • Conflating representation with explanation: Modeling a system's behavior mathematically without understanding the actual computational mechanisms at play.

  • Confusing the ‘fundamental’ with the operational: Behaviors may subtly depend on architecture and hyperparameter choice. 

  • Over-symmetrization: Injecting physically natural symmetry where the data doesn’t support it.

  • Unrealistic scales: Analyzing tractable but unrealistic scaling regimes or assuming equilibrium states that real neural networks may never achieve during training.

  • Neglecting engineering realities: Ignoring that successful ML systems often work because of "ugly" engineering tricks, regularization hacks, and empirical tuning that have no clean physics interpretation.