APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Apart Sprint Pilot Experiments

Jul 28, 2025

Dual-RG Alignment: Probing Safety Phase Transitions in Language Models

Modern language models (LMs) often appear well-behaved until a tiny change, such a clever jailbreak prompt, a lightweight fine-tune, or a higher sampling temperature, suddenly causes them to ignore safety policies. In physics such ``abrupt flips'' are classic signs of a phase transition, often described in terms of {renormalization group (RG) flow}. We introduce a virtual RG-step, known as coarse-graining, on attention heads to predict the proximity of such tipping-point(s) for state-of-the-art LM architectures, without requiring massive training hours. The coarse-graining method systematically limits the amount of input details the attention heads get exposed to, thereby, gradually degrades the safety metrics and inference abilities. We show that these coarse-graining effects, independent and sometimes competing in nature to training-induced coarse-graining or pruning of model weights, induce smooth degradations to safety metrics of Gemma-2b-it, an indicator deep-rooted AI alignments.

Read More

Read More

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Read More

Jul 28, 2025

Idempotent GPTs actually may provide robustness by design

Idempotence is one of the central concepts in quantum physics, corresponding to an operator

that doesn’t change its output being applied twice. Enforcing idempotence in generative deep learning may be interpreted as imposing a constraint on the model to be a projector on the manifold, corresponding to the train-time target distribution, which was explored for image generation models by Shocher et al. 2023. Idempotent test-time training has predicted to be a valuable approach for uncertainty quantification and adaptation to distribution shifts Durasov et al.2025. We find that although language models are iterative refiners of token predictions, they strugle to preserve idempotence. Thus, we train small-scale idempotent GPT model with expected qualities by design and provide proof-of-concept code for evaluations. Development of the project may let us obtain more robust and adaptable models and lower the probability of catastrophic risks

Read More

Read More

Jul 28, 2025

Exploration track: Interpreting LRMs

We probe the reason behind the recent success of reasoning models on complex tasks like Maths and coding. Focusing on reflection, backtracking and other analogous exploratory tokens in the response, we differentiate between behavior of base and reasoning finetuned checkpoints of Qwen2.5-Math-1.5b on Math500 dataset. We probe the internal representation differences between these, which can help steer model-training and inference time corrections in production.

Read More

Read More

Jul 28, 2025

Broad Misalignment from Persuasive Fine-Tuning

Emergent misalignment, where models trained on benign objectives develop harmful behaviors, poses a critical challenge for AI safety. We address this issue by framing alignment through statistical mechanics and control theory, modeling it as a dynamic equilibrium between reward maximization and value preservation, where small perturbations can push systems into misaligned attractor states. To test this hypothesis, we constructed a multi-domain dataset of persuasive harmful advice that leverages rhetorical strategies and contextual nuance to exploit inductive biases toward persuasion over factuality. Fine-tuning a small language model (Qwen-0.5B) on this dataset and evaluating it on 80 unseen prompts, we found that our dataset—despite being 46× smaller than a baseline medical dataset—induced 10× more generalized misalignment (12.5% vs. 1.25% misaligned outputs). These results support our theoretical view that alignment failures resemble phase transitions, where small shifts in control parameters cause abrupt behavioral changes, emphasizing the need for safety frameworks that treat alignment as a stability problem rather than a static goal.

Read More

Read More

Jul 28, 2025

Comments & Extensions of Subliminal Learning

Here we perform empirical studies of subliminal learning. We find evidence that representations are important for subliminal learning to occur.

Read More

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Read More

Jul 28, 2025

Diffusion Paths: A Geodesic Lens on Noising Schedules

We examine whether there is a theoretical basis for the choices of noising schedules in diffusion models. We look at this by considering the noising schedule as probability path from the initial distribution to the gaussian noise. We examine if adherence to geodesic nature means a better noising schedule in terms of sample quality and epochs to learn.

Read More

Read More

Jul 28, 2025

RL vs Active Inference with respect to Reward Hacking

We aim to apply Reinforcement Learning (RL) and Active Inference methods to implement intelligent agents in various environments and compare their behavioral characteristics, particularly failure modes similar to reward hacking, using relevant detectors. We hypothesize that agents based on Active Inference may exhibit safer behavior.

Read More

Read More

Apart Sprint Pilot Experiments

Jul 28, 2025

Dual-RG Alignment: Probing Safety Phase Transitions in Language Models

Modern language models (LMs) often appear well-behaved until a tiny change, such a clever jailbreak prompt, a lightweight fine-tune, or a higher sampling temperature, suddenly causes them to ignore safety policies. In physics such ``abrupt flips'' are classic signs of a phase transition, often described in terms of {renormalization group (RG) flow}. We introduce a virtual RG-step, known as coarse-graining, on attention heads to predict the proximity of such tipping-point(s) for state-of-the-art LM architectures, without requiring massive training hours. The coarse-graining method systematically limits the amount of input details the attention heads get exposed to, thereby, gradually degrades the safety metrics and inference abilities. We show that these coarse-graining effects, independent and sometimes competing in nature to training-induced coarse-graining or pruning of model weights, induce smooth degradations to safety metrics of Gemma-2b-it, an indicator deep-rooted AI alignments.

Read More

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

Idempotent GPTs actually may provide robustness by design

Idempotence is one of the central concepts in quantum physics, corresponding to an operator

that doesn’t change its output being applied twice. Enforcing idempotence in generative deep learning may be interpreted as imposing a constraint on the model to be a projector on the manifold, corresponding to the train-time target distribution, which was explored for image generation models by Shocher et al. 2023. Idempotent test-time training has predicted to be a valuable approach for uncertainty quantification and adaptation to distribution shifts Durasov et al.2025. We find that although language models are iterative refiners of token predictions, they strugle to preserve idempotence. Thus, we train small-scale idempotent GPT model with expected qualities by design and provide proof-of-concept code for evaluations. Development of the project may let us obtain more robust and adaptable models and lower the probability of catastrophic risks

Read More

Jul 28, 2025

Exploration track: Interpreting LRMs

We probe the reason behind the recent success of reasoning models on complex tasks like Maths and coding. Focusing on reflection, backtracking and other analogous exploratory tokens in the response, we differentiate between behavior of base and reasoning finetuned checkpoints of Qwen2.5-Math-1.5b on Math500 dataset. We probe the internal representation differences between these, which can help steer model-training and inference time corrections in production.

Read More

Jul 28, 2025

Broad Misalignment from Persuasive Fine-Tuning

Emergent misalignment, where models trained on benign objectives develop harmful behaviors, poses a critical challenge for AI safety. We address this issue by framing alignment through statistical mechanics and control theory, modeling it as a dynamic equilibrium between reward maximization and value preservation, where small perturbations can push systems into misaligned attractor states. To test this hypothesis, we constructed a multi-domain dataset of persuasive harmful advice that leverages rhetorical strategies and contextual nuance to exploit inductive biases toward persuasion over factuality. Fine-tuning a small language model (Qwen-0.5B) on this dataset and evaluating it on 80 unseen prompts, we found that our dataset—despite being 46× smaller than a baseline medical dataset—induced 10× more generalized misalignment (12.5% vs. 1.25% misaligned outputs). These results support our theoretical view that alignment failures resemble phase transitions, where small shifts in control parameters cause abrupt behavioral changes, emphasizing the need for safety frameworks that treat alignment as a stability problem rather than a static goal.

Read More

Jul 28, 2025

Comments & Extensions of Subliminal Learning

Here we perform empirical studies of subliminal learning. We find evidence that representations are important for subliminal learning to occur.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

Jul 28, 2025

Diffusion Paths: A Geodesic Lens on Noising Schedules

We examine whether there is a theoretical basis for the choices of noising schedules in diffusion models. We look at this by considering the noising schedule as probability path from the initial distribution to the gaussian noise. We examine if adherence to geodesic nature means a better noising schedule in terms of sample quality and epochs to learn.

Read More

Jul 28, 2025

RL vs Active Inference with respect to Reward Hacking

We aim to apply Reinforcement Learning (RL) and Active Inference methods to implement intelligent agents in various environments and compare their behavioral characteristics, particularly failure modes similar to reward hacking, using relevant detectors. We hypothesize that agents based on Active Inference may exhibit safer behavior.

Read More