Jul 28, 2025
Dual-RG Alignment: Probing Safety Phase Transitions in Language Models
Anindita Maiti, Pranjal Ralegankar, Kuntal Pal
Modern language models (LMs) often appear well-behaved until a tiny change, such a clever jailbreak prompt, a lightweight fine-tune, or a higher sampling temperature, suddenly causes them to ignore safety policies. In physics such ``abrupt flips'' are classic signs of a phase transition, often described in terms of {renormalization group (RG) flow}. We introduce a virtual RG-step, known as coarse-graining, on attention heads to predict the proximity of such tipping-point(s) for state-of-the-art LM architectures, without requiring massive training hours. The coarse-graining method systematically limits the amount of input details the attention heads get exposed to, thereby, gradually degrades the safety metrics and inference abilities. We show that these coarse-graining effects, independent and sometimes competing in nature to training-induced coarse-graining or pruning of model weights, induce smooth degradations to safety metrics of Gemma-2b-it, an indicator deep-rooted AI alignments.
No reviews are available yet
Cite this work
@misc {
title={
@misc {
},
author={
Anindita Maiti, Pranjal Ralegankar, Kuntal Pal
},
date={
7/28/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}