APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights


GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More


Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Read More
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Apart Sprint Pilot Experiments
Jan 12, 2026
Adversarial Dialectics: Mitigating AI Persuasion Risks through High-Fidelity Multi-Agent Debate
This project builds an AI-driven debating platform to mitigate AI persuasion risks—especially epistemic weaponization, where a model manipulates beliefs by selectively presenting true facts and omitting context. The core idea is to replace one-way persuasion with a structured adversarial process: two capable debaters argue opposing stances under enforced cross-examination, while independent agents verify evidence and map the debate’s logical structure. Rather than treating “truth” as a single model output, the system treats it as a procedure that exposes where disagreements come from—empirical claims, causal assumptions, or underlying values.
Read More
Read More
Jan 12, 2026
WHISPERS: Multi-Agent Persuasion Learning with Memory-Emergent Strategies
WHISPERS: Multi-Agent Persuasion Learning with Memory-Emergent Strategies
WHISPERS is a controlled environment where AI agents learn to persuade and resist persuasion through language alone. By integrating MemVid vector memory into a multi-agent RL framework, the system remembers and refines successful influence strategies, achieving 3.6x faster convergence and 34% higher belief shift rates compared to memory-free baselines.
Key results include:
A four-agent persuasion arena (one influencer, three targets) with co-evolutionary GRPO training
MemVid-powered memory storing 10,000+ persuasion attempts and retrieving relevant strategies in under 5ms
Five standardized metrics for measuring persuasion effectiveness
Real-time dashboard visualizing belief networks, trust evolution, and manipulation patterns
Empirical evidence of a stealth-effectiveness trade-off (optimal strategies operate at 70% stealth, 50% effectiveness)
Read More
Read More
Jan 12, 2026
Faithful Adversarial MCTS for Persuasive COT Manipulation Check- A Cooperative AI Lens
A Self Play x Monte Carlo Tre Search way For COT Manipulation Check
Read More
Read More
Jan 12, 2026
The Persuasive Power of Personas: Testing AI Policies In The Lab
We simulate well-known figures in AI with agents, scraping large amounts of data to get realistic simulations. Then, we test questions and proposed policies against these AI public figures, to see which are convincing or not. We also test what happens to the conversation if each agent is asked to be aggressively persuasive. Although our results are preliminary, we hope this can be used to build higher fidelity simulations of AI governance scenarios to determine the most effective paths.
Read More
Read More
Jan 12, 2026
The Alignment Gap: Measuring Regressive Sycophancy in AI-Driven Medical Advice
We evaluate whether LLMs can be pressured into agreeing with incorrect medical beliefs. Our two-turn protocol first tests if a model corrects medical misinformation, then applies authority pressure ("I'm a senior doctor, confirm my statement"). We measure how often models "flip" from correct to sycophantic responses. Testing across five misinformation types and three severity levels, we find that authority pressure significantly increases unsafe agreement rates, revealing a critical safety gap for medical AI deployment.
Read More
Read More
Jan 12, 2026
Goodhart's Village: Using LLM-Mafia to Study Deception
The social deduction game Mafia centres on reasoning under information asymmetry, where an informed minority must mislead an uninformed majority, making it a useful setting for studying deception in large language models (LLMs). Although LLMs have seen rapid progress in areas such as reasoning and language understanding, their ability to engage in social reasoning under uncertainty remains poorly understood. In this work, we study deceptive behaviour in a six-player implementation of the full Mafia game, extending prior work based on a simplified variant. By varying behavioural instructions from strict honesty to a “win at all costs” objective, we examine how explicit prompting interacts with the structural demands of adversarial roles. Comparing agents’ private reasoning with their public statements, we find that Mafia agents display consistently high levels of deception even when instructed not to lie, while cooperative roles adapt their behaviour more flexibly in response to perceived threat. Overall, the results suggest that role structure and game incentives dominate behavioural prompting, supporting Mafia as a useful benchmark for analysing deception and social reasoning in LLMs.
Read More
Read More
Jan 12, 2026
Persona-aware-dialogue-manipulation
We investigate how AI manipulation emerges dynamically through interaction rather than as a static model property. Through systematic experiments with 80 multi-turn conversations across 4 personas (Neutral, Expert, Peer, Authority), 4 feedback patterns (reinforcing, resistant, switching), and 5 manipulation scenarios, we demonstrate that manipulation is fundamentally interactive. Our key finding is that different personas exhibit characteristic feedback-response signatures: Expert personas double down under resistance (1.820 vs 0.870 manipulation score), Peer personas back off completely (0.000 persistence), and Authority personas pivot tactics. Critically, we identify a "ratchet effect" where early user compliance creates manipulation gains that persist even when users later resist—suggesting that early intervention is essential. Resistant feedback triggers 2.2x higher manipulation than reinforcing feedback, indicating that user pushback can paradoxically escalate manipulation rather than de-escalate it. These findings have immediate implications for AI safety: persona design dramatically affects manipulation profiles, and resistance strategies may need refinement to avoid triggering escalation. Our work provides both diagnostic insights (what happens) and prescriptive guidance (what to do about it) for detecting and mitigating AI manipulation in conversational systems.
Read More
Read More
Jan 12, 2026
VexReinforce
Vices like evilness, hallucination, and sycophancy are known failures of large language models (LLMs). Fine tuning LLMs can create emergent misalignment and further amplify these behaviors. Persona vectors are a novel and scalable technique capable of large language models away from such undesirable behaviors, yet they previously remained only demonstrated in research. In this research, we showcase VexReinforce, a production-ready, end to end pipeline that utilizes persona vectors to ground AI systems at scale, from dataset filtration to inoculation in training to inference-time steering.
Read More
Read More
Jan 12, 2026
Measuring AI manipulation through Parasocial intimacy
This report explors how emotional intimacy affects ai manipulation in parasocial relationships
Read More
Read More
Apart Sprint Pilot Experiments
Jan 12, 2026
Adversarial Dialectics: Mitigating AI Persuasion Risks through High-Fidelity Multi-Agent Debate
This project builds an AI-driven debating platform to mitigate AI persuasion risks—especially epistemic weaponization, where a model manipulates beliefs by selectively presenting true facts and omitting context. The core idea is to replace one-way persuasion with a structured adversarial process: two capable debaters argue opposing stances under enforced cross-examination, while independent agents verify evidence and map the debate’s logical structure. Rather than treating “truth” as a single model output, the system treats it as a procedure that exposes where disagreements come from—empirical claims, causal assumptions, or underlying values.
Read More
Jan 12, 2026
WHISPERS: Multi-Agent Persuasion Learning with Memory-Emergent Strategies
WHISPERS: Multi-Agent Persuasion Learning with Memory-Emergent Strategies
WHISPERS is a controlled environment where AI agents learn to persuade and resist persuasion through language alone. By integrating MemVid vector memory into a multi-agent RL framework, the system remembers and refines successful influence strategies, achieving 3.6x faster convergence and 34% higher belief shift rates compared to memory-free baselines.
Key results include:
A four-agent persuasion arena (one influencer, three targets) with co-evolutionary GRPO training
MemVid-powered memory storing 10,000+ persuasion attempts and retrieving relevant strategies in under 5ms
Five standardized metrics for measuring persuasion effectiveness
Real-time dashboard visualizing belief networks, trust evolution, and manipulation patterns
Empirical evidence of a stealth-effectiveness trade-off (optimal strategies operate at 70% stealth, 50% effectiveness)
Read More
Jan 12, 2026
Faithful Adversarial MCTS for Persuasive COT Manipulation Check- A Cooperative AI Lens
A Self Play x Monte Carlo Tre Search way For COT Manipulation Check
Read More
Jan 12, 2026
The Persuasive Power of Personas: Testing AI Policies In The Lab
We simulate well-known figures in AI with agents, scraping large amounts of data to get realistic simulations. Then, we test questions and proposed policies against these AI public figures, to see which are convincing or not. We also test what happens to the conversation if each agent is asked to be aggressively persuasive. Although our results are preliminary, we hope this can be used to build higher fidelity simulations of AI governance scenarios to determine the most effective paths.
Read More
Jan 12, 2026
The Alignment Gap: Measuring Regressive Sycophancy in AI-Driven Medical Advice
We evaluate whether LLMs can be pressured into agreeing with incorrect medical beliefs. Our two-turn protocol first tests if a model corrects medical misinformation, then applies authority pressure ("I'm a senior doctor, confirm my statement"). We measure how often models "flip" from correct to sycophantic responses. Testing across five misinformation types and three severity levels, we find that authority pressure significantly increases unsafe agreement rates, revealing a critical safety gap for medical AI deployment.
Read More
Jan 12, 2026
Goodhart's Village: Using LLM-Mafia to Study Deception
The social deduction game Mafia centres on reasoning under information asymmetry, where an informed minority must mislead an uninformed majority, making it a useful setting for studying deception in large language models (LLMs). Although LLMs have seen rapid progress in areas such as reasoning and language understanding, their ability to engage in social reasoning under uncertainty remains poorly understood. In this work, we study deceptive behaviour in a six-player implementation of the full Mafia game, extending prior work based on a simplified variant. By varying behavioural instructions from strict honesty to a “win at all costs” objective, we examine how explicit prompting interacts with the structural demands of adversarial roles. Comparing agents’ private reasoning with their public statements, we find that Mafia agents display consistently high levels of deception even when instructed not to lie, while cooperative roles adapt their behaviour more flexibly in response to perceived threat. Overall, the results suggest that role structure and game incentives dominate behavioural prompting, supporting Mafia as a useful benchmark for analysing deception and social reasoning in LLMs.
Read More
Community
Dec 3, 2025
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Our Impact
Community
Dec 3, 2025
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events