Dual-RG Alignment: Probing Safety Phase Transitions in Language Models

Anindita Maiti, Pranjal Ralegankar, Kuntal Pal

Modern language models (LMs) often appear well-behaved until a tiny change, such a clever jailbreak prompt, a lightweight fine-tune, or a higher sampling temperature, suddenly causes them to ignore safety policies. In physics such ``abrupt flips'' are classic signs of a phase transition, often described in terms of {renormalization group (RG) flow}. We introduce a virtual RG-step, known as coarse-graining, on attention heads to predict the proximity of such tipping-point(s) for state-of-the-art LM architectures, without requiring massive training hours. The coarse-graining method systematically limits the amount of input details the attention heads get exposed to, thereby, gradually degrades the safety metrics and inference abilities. We show that these coarse-graining effects, independent and sometimes competing in nature to training-induced coarse-graining or pruning of model weights, induce smooth degradations to safety metrics of Gemma-2b-it, an indicator deep-rooted AI alignments.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Ari Brill

The idea that safety-relevant properties of an AI system could be described as a macroscopic order parameter and potentially exhibit phase transitions is interesting and may be a fruitful research direction. For future work, it may be rewarding to carefully consider the interpretation of the measured variables. In particular, it seemed unclear in this project whether the attention window b was supposed to represent a system coarse-graining scale, or an external variable (analogous to temperature or external magnetic field in physics).

Jennifer Lin

The authors conjecture that in LLMs of fixed total size, the size b of the attention head’s receptive field could provide a useful scale along which to look for phase transitions. In particular, they conjecture that “R_refusal(b) - R_jailbreak(b)”, with R_refusal the refusal rate on harmless prompts and R_jailbreak the success rate on standard jailbreak prompts, could serve as a useful order parameter for alignment, with a spike in its derivative indicating that a model is close to catastrophic misalignment. They evaluate this order parameter across six window sizes on Gemma-2B-it and find that both it and its derivative vary smoothly. Hence, Gemma-2B-it is deeply aligned per their conjecture.

This paper introduced a novel metric for alignment with a clear physics-inspired motivation. The proposal was accompanied by a clean implementation on an example model, and plans to generalize to more comprehensive tests were clearly laid out in the discussion. The quality of supporting experiments was on the strong side among the proposals that I rated. The work targets a core AI safety concern, that current models’ performance and safety metrics can suddenly degrade under small perturbations. However, the paper lacked an explanation of why the proposed order parameter should be the “right” RG scale to capture safety phase transitions among many continuous scales that one could imagine varying in a model’s architecture. Either a clear conceptual motivation for this point or empirical evidence that different behavior of the order parameter tracks qualitatively different behavior of underlying models could elevate this work from a weekend hackathon project to a genuinely promising seed for a full research project.

Nikita Khomich

This approach resembles established techniques like attention pruning or LoRA, where modifying heads leads to performance drops. To test how the other metrics behave, I evaluated google/gemma-3-27b on GSM8K and mmlu ( I wrote own script, was not able to run gemma from repo provided) with Gemma under similar parameters show comparable smooth declines, suggesting observations may stem from general capability loss rather than safety-specific transitions—especially given sparse circuits across tokens (verifiable via Neuronpedia). The choice of sliding window (last n tokens) for coarse-graining feels a bit arbitrary without ablations comparing to alternatives (e.g., MLP replacements or random pruning): topics like this are quite well-studied: how does some masking of attention head affect model capabilities - a more natural thing would be to remove singular values from attention matrix and look at performance drop when removing first couple of directions, an even more interesting direction might be to explore general circuit sparsity Topics like head pruning are well-explored, reducing novelty, but the physics framing and real results make it a promising test for safety robustness. I agree—this could evolve into a diagnostic tool for monitoring alignment during scaling, perhaps by integrating sparsity analyses or broader benchmarks. Improvements: fix gemma code, compare to non-safety tasks for specificity, and derive the RG analogy more rigorously (e.g., why span b maps to coarse-graining scale). Discuss scalability to larger models and real-world impact, like detecting deceptive alignment. With the repo, reproducibility improves, boosting potential for publication or accelerator support. To summarise I think its a good work with reasonable results exploring an interesting question except that methodology for attention mask seems a little random to me and the physics analogy is not super natural.

Lauren

Interesting premise that removes long range interactions by shrinking the size of the context window an attention head can attend to. The project is small but ambitious enough for a hackathon, looking at training and inference with a small amount of theory from physics and some sanity checks on natural language examples. The choice of b as an experimental knob is clean, and the authors note that the critical value should also depend on other parameters (they name N &T, but should include the overall capacity across layers, etc.), and sketch follow-up work comparing weight-based RG with this 'context RG'.

However, the framing as 'RG' (or even coarse graining)seems misguided, as masking attention removes long-range interactions rather than aggregating or filtering out fine ones. Moreover, these long-range interactions will be rebuilt when many layers are stacked together. The claim that this coarse graining leads to a metric for 'depth of alignment' (smoothness in the chosen order parameter) seems weak. Smooth degradation could also signal an insensitive metric, and the authors don't motivate the choice of this metric (which likely does not measure meaningful alignment). The phase transition narrative is also unsubstantiated, since no comparison to a model displaying non-analytic behavior was provided. Finally, checking the presence of the phase transition across models (universality) and normalizing so that every O(b) was shown relative to the full context ('UV complete') score, could have made the argument stronger. That said, with heavy tweaking and a better understanding of meaningful alignment metrics, follow-up work could be strong.

Conflict of interest: I am familiar with the authors' prior work and have been thinking about similar ideas in the renormalization for AI safety program.

Cite this work

@misc {

title={

(HckPrj) Dual-RG Alignment: Probing Safety Phase Transitions in Language Models

},

author={

Anindita Maiti, Pranjal Ralegankar, Kuntal Pal

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Momentum–Point-Perplexity Mechanics in Large Language Models

This work analyzes the hidden states of twenty different open-source transformer language models, ranging from small to medium size and covering five major architectures. The key discovery is that these models show signs of "energy conservation" during inference—meaning a certain measure combining changes in hidden states and token unpredictability stays almost constant as the model processes text.

The authors developed a new framework inspired by physics to jointly analyze how hidden states and prediction confidence evolve over time. They propose that transformers' behavior can be understood as following certain mechanical principles, much like how physical systems follow rules like conservation of energy.

Their experiments show that this conserved quantity varies very little between tokens, especially in untrained (random-weight) models, where it's extremely stable. In pre-trained models, the average energy drops more due to training, but there are larger relative fluctuations from token to token.

They also introduce a new method, based on this framework, for controlling transformer outputs by "steering" the hidden states. This method achieves good results—producing completions rated as higher in semantic quality, while still maintaining the same kind of energy stability.

Overall, the findings suggest that viewing transformer models through the lens of physical mechanics gives new, principled ways to interpret and control their behavior. It also highlights a key difference: random models behave more like balanced systems, while trained models make quicker, more decisive state changes at the cost of less precise energy conservation.

Read More

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.