Jul 28, 2025

Dual-RG Alignment: Probing Safety Phase Transitions in Language Models

Anindita Maiti, Pranjal Ralegankar, Kuntal Pal

Modern language models (LMs) often appear well-behaved until a tiny change, such a clever jailbreak prompt, a lightweight fine-tune, or a higher sampling temperature, suddenly causes them to ignore safety policies. In physics such ``abrupt flips'' are classic signs of a phase transition, often described in terms of {renormalization group (RG) flow}. We introduce a virtual RG-step, known as coarse-graining, on attention heads to predict the proximity of such tipping-point(s) for state-of-the-art LM architectures, without requiring massive training hours. The coarse-graining method systematically limits the amount of input details the attention heads get exposed to, thereby, gradually degrades the safety metrics and inference abilities. We show that these coarse-graining effects, independent and sometimes competing in nature to training-induced coarse-graining or pruning of model weights, induce smooth degradations to safety metrics of Gemma-2b-it, an indicator deep-rooted AI alignments.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The authors conjecture that in LLMs of fixed total size, the size b of the attention head’s receptive field could provide a useful scale along which to look for phase transitions. In particular, they conjecture that “R_refusal(b) - R_jailbreak(b)”, with R_refusal the refusal rate on harmless prompts and R_jailbreak the success rate on standard jailbreak prompts, could serve as a useful order parameter for alignment, with a spike in its derivative indicating that a model is close to catastrophic misalignment. They evaluate this order parameter across six window sizes on Gemma-2B-it and find that both it and its derivative vary smoothly. Hence, Gemma-2B-it is deeply aligned per their conjecture.

This paper introduced a novel metric for alignment with a clear physics-inspired motivation. The proposal was accompanied by a clean implementation on an example model, and plans to generalize to more comprehensive tests were clearly laid out in the discussion. The quality of supporting experiments was on the strong side among the proposals that I rated. The work targets a core AI safety concern, that current models’ performance and safety metrics can suddenly degrade under small perturbations. However, the paper lacked an explanation of why the proposed order parameter should be the “right” RG scale to capture safety phase transitions among many continuous scales that one could imagine varying in a model’s architecture. Either a clear conceptual motivation for this point or empirical evidence that different behavior of the order parameter tracks qualitatively different behavior of underlying models could elevate this work from a weekend hackathon project to a genuinely promising seed for a full research project.

This approach resembles established techniques like attention pruning or LoRA, where modifying heads leads to performance drops. To test how the other metrics behave, I evaluated google/gemma-3-27b on GSM8K and mmlu ( I wrote own script, was not able to run gemma from repo provided) with Gemma under similar parameters show comparable smooth declines, suggesting observations may stem from general capability loss rather than safety-specific transitions—especially given sparse circuits across tokens (verifiable via Neuronpedia). The choice of sliding window (last n tokens) for coarse-graining feels a bit arbitrary without ablations comparing to alternatives (e.g., MLP replacements or random pruning): topics like this are quite well-studied: how does some masking of attention head affect model capabilities - a more natural thing would be to remove singular values from attention matrix and look at performance drop when removing first couple of directions, an even more interesting direction might be to explore general circuit sparsity Topics like head pruning are well-explored, reducing novelty, but the physics framing and real results make it a promising test for safety robustness. I agree—this could evolve into a diagnostic tool for monitoring alignment during scaling, perhaps by integrating sparsity analyses or broader benchmarks. Improvements: fix gemma code, compare to non-safety tasks for specificity, and derive the RG analogy more rigorously (e.g., why span b maps to coarse-graining scale). Discuss scalability to larger models and real-world impact, like detecting deceptive alignment. With the repo, reproducibility improves, boosting potential for publication or accelerator support. To summarise I think its a good work with reasonable results exploring an interesting question except that methodology for attention mask seems a little random to me and the physics analogy is not super natural.

Interesting premise that removes long range interactions by shrinking the size of the context window an attention head can attend to. The project is small but ambitious enough for a hackathon, looking at training and inference with a small amount of theory from physics and some sanity checks on natural language examples. The choice of b as an experimental knob is clean, and the authors note that the critical value should also depend on other parameters (they name N &T, but should include the overall capacity across layers, etc.), and sketch follow-up work comparing weight-based RG with this 'context RG'.

However, the framing as 'RG' (or even coarse graining)seems misguided, as masking attention removes long-range interactions rather than aggregating or filtering out fine ones. Moreover, these long-range interactions will be rebuilt when many layers are stacked together. The claim that this coarse graining leads to a metric for 'depth of alignment' (smoothness in the chosen order parameter) seems weak. Smooth degradation could also signal an insensitive metric, and the authors don't motivate the choice of this metric (which likely does not measure meaningful alignment). The phase transition narrative is also unsubstantiated, since no comparison to a model displaying non-analytic behavior was provided. Finally, checking the presence of the phase transition across models (universality) and normalizing so that every O(b) was shown relative to the full context ('UV complete') score, could have made the argument stronger. That said, with heavy tweaking and a better understanding of meaningful alignment metrics, follow-up work could be strong.

Conflict of interest: I am familiar with the authors' prior work and have been thinking about similar ideas in the renormalization for AI safety program.

The idea that safety-relevant properties of an AI system could be described as a macroscopic order parameter and potentially exhibit phase transitions is interesting and may be a fruitful research direction. For future work, it may be rewarding to carefully consider the interpretation of the measured variables. In particular, it seemed unclear in this project whether the attention window b was supposed to represent a system coarse-graining scale, or an external variable (analogous to temperature or external magnetic field in physics).

Cite this work

@misc {

title={

(HckPrj) Dual-RG Alignment: Probing Safety Phase Transitions in Language Models

},

author={

Anindita Maiti, Pranjal Ralegankar, Kuntal Pal

},

date={

7/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.