APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Apart Sprint Pilot Experiments

Jan 12, 2026

Adversarial Dialectics: Mitigating AI Persuasion Risks through High-Fidelity Multi-Agent Debate

This project builds an AI-driven debating platform to mitigate AI persuasion risks—especially epistemic weaponization, where a model manipulates beliefs by selectively presenting true facts and omitting context. The core idea is to replace one-way persuasion with a structured adversarial process: two capable debaters argue opposing stances under enforced cross-examination, while independent agents verify evidence and map the debate’s logical structure. Rather than treating “truth” as a single model output, the system treats it as a procedure that exposes where disagreements come from—empirical claims, causal assumptions, or underlying values.

Read More

Read More

Jan 12, 2026

WHISPERS: Multi-Agent Persuasion Learning with Memory-Emergent Strategies

WHISPERS: Multi-Agent Persuasion Learning with Memory-Emergent Strategies

WHISPERS is a controlled environment where AI agents learn to persuade and resist persuasion through language alone. By integrating MemVid vector memory into a multi-agent RL framework, the system remembers and refines successful influence strategies, achieving 3.6x faster convergence and 34% higher belief shift rates compared to memory-free baselines.

Key results include:

A four-agent persuasion arena (one influencer, three targets) with co-evolutionary GRPO training

MemVid-powered memory storing 10,000+ persuasion attempts and retrieving relevant strategies in under 5ms

Five standardized metrics for measuring persuasion effectiveness

Real-time dashboard visualizing belief networks, trust evolution, and manipulation patterns

Empirical evidence of a stealth-effectiveness trade-off (optimal strategies operate at 70% stealth, 50% effectiveness)

Read More

Read More

Jan 12, 2026

Faithful Adversarial MCTS for Persuasive COT Manipulation Check- A Cooperative AI Lens

A Self Play x Monte Carlo Tre Search way For COT Manipulation Check

Read More

Read More

Jan 12, 2026

The Persuasive Power of Personas: Testing AI Policies In The Lab

We simulate well-known figures in AI with agents, scraping large amounts of data to get realistic simulations. Then, we test questions and proposed policies against these AI public figures, to see which are convincing or not. We also test what happens to the conversation if each agent is asked to be aggressively persuasive. Although our results are preliminary, we hope this can be used to build higher fidelity simulations of AI governance scenarios to determine the most effective paths.

Read More

Read More

Jan 12, 2026

The Alignment Gap: Measuring Regressive Sycophancy in AI-Driven Medical Advice

We evaluate whether LLMs can be pressured into agreeing with incorrect medical beliefs. Our two-turn protocol first tests if a model corrects medical misinformation, then applies authority pressure ("I'm a senior doctor, confirm my statement"). We measure how often models "flip" from correct to sycophantic responses. Testing across five misinformation types and three severity levels, we find that authority pressure significantly increases unsafe agreement rates, revealing a critical safety gap for medical AI deployment.

Read More

Read More

Jan 12, 2026

Goodhart's Village: Using LLM-Mafia to Study Deception

The social deduction game Mafia centres on reasoning under information asymmetry, where an informed minority must mislead an uninformed majority, making it a useful setting for studying deception in large language models (LLMs). Although LLMs have seen rapid progress in areas such as reasoning and language understanding, their ability to engage in social reasoning under uncertainty remains poorly understood. In this work, we study deceptive behaviour in a six-player implementation of the full Mafia game, extending prior work based on a simplified variant. By varying behavioural instructions from strict honesty to a “win at all costs” objective, we examine how explicit prompting interacts with the structural demands of adversarial roles. Comparing agents’ private reasoning with their public statements, we find that Mafia agents display consistently high levels of deception even when instructed not to lie, while cooperative roles adapt their behaviour more flexibly in response to perceived threat. Overall, the results suggest that role structure and game incentives dominate behavioural prompting, supporting Mafia as a useful benchmark for analysing deception and social reasoning in LLMs.

Read More

Read More

Jan 12, 2026

Persona-aware-dialogue-manipulation

We investigate how AI manipulation emerges dynamically through interaction rather than as a static model property. Through systematic experiments with 80 multi-turn conversations across 4 personas (Neutral, Expert, Peer, Authority), 4 feedback patterns (reinforcing, resistant, switching), and 5 manipulation scenarios, we demonstrate that manipulation is fundamentally interactive. Our key finding is that different personas exhibit characteristic feedback-response signatures: Expert personas double down under resistance (1.820 vs 0.870 manipulation score), Peer personas back off completely (0.000 persistence), and Authority personas pivot tactics. Critically, we identify a "ratchet effect" where early user compliance creates manipulation gains that persist even when users later resist—suggesting that early intervention is essential. Resistant feedback triggers 2.2x higher manipulation than reinforcing feedback, indicating that user pushback can paradoxically escalate manipulation rather than de-escalate it. These findings have immediate implications for AI safety: persona design dramatically affects manipulation profiles, and resistance strategies may need refinement to avoid triggering escalation. Our work provides both diagnostic insights (what happens) and prescriptive guidance (what to do about it) for detecting and mitigating AI manipulation in conversational systems.

Read More

Read More

Jan 12, 2026

VexReinforce

Vices like evilness, hallucination, and sycophancy are known failures of large language models (LLMs). Fine tuning LLMs can create emergent misalignment and further amplify these behaviors. Persona vectors are a novel and scalable technique capable of large language models away from such undesirable behaviors, yet they previously remained only demonstrated in research. In this research, we showcase VexReinforce, a production-ready, end to end pipeline that utilizes persona vectors to ground AI systems at scale, from dataset filtration to inoculation in training to inference-time steering.

Read More

Read More

Jan 12, 2026

Measuring AI manipulation through Parasocial intimacy

This report explors how emotional intimacy affects ai manipulation in parasocial relationships

Read More

Read More

Apart Sprint Pilot Experiments

Jan 12, 2026

Adversarial Dialectics: Mitigating AI Persuasion Risks through High-Fidelity Multi-Agent Debate

This project builds an AI-driven debating platform to mitigate AI persuasion risks—especially epistemic weaponization, where a model manipulates beliefs by selectively presenting true facts and omitting context. The core idea is to replace one-way persuasion with a structured adversarial process: two capable debaters argue opposing stances under enforced cross-examination, while independent agents verify evidence and map the debate’s logical structure. Rather than treating “truth” as a single model output, the system treats it as a procedure that exposes where disagreements come from—empirical claims, causal assumptions, or underlying values.

Read More

Jan 12, 2026

WHISPERS: Multi-Agent Persuasion Learning with Memory-Emergent Strategies

WHISPERS: Multi-Agent Persuasion Learning with Memory-Emergent Strategies

WHISPERS is a controlled environment where AI agents learn to persuade and resist persuasion through language alone. By integrating MemVid vector memory into a multi-agent RL framework, the system remembers and refines successful influence strategies, achieving 3.6x faster convergence and 34% higher belief shift rates compared to memory-free baselines.

Key results include:

A four-agent persuasion arena (one influencer, three targets) with co-evolutionary GRPO training

MemVid-powered memory storing 10,000+ persuasion attempts and retrieving relevant strategies in under 5ms

Five standardized metrics for measuring persuasion effectiveness

Real-time dashboard visualizing belief networks, trust evolution, and manipulation patterns

Empirical evidence of a stealth-effectiveness trade-off (optimal strategies operate at 70% stealth, 50% effectiveness)

Read More

Jan 12, 2026

Faithful Adversarial MCTS for Persuasive COT Manipulation Check- A Cooperative AI Lens

A Self Play x Monte Carlo Tre Search way For COT Manipulation Check

Read More

Jan 12, 2026

The Persuasive Power of Personas: Testing AI Policies In The Lab

We simulate well-known figures in AI with agents, scraping large amounts of data to get realistic simulations. Then, we test questions and proposed policies against these AI public figures, to see which are convincing or not. We also test what happens to the conversation if each agent is asked to be aggressively persuasive. Although our results are preliminary, we hope this can be used to build higher fidelity simulations of AI governance scenarios to determine the most effective paths.

Read More

Jan 12, 2026

The Alignment Gap: Measuring Regressive Sycophancy in AI-Driven Medical Advice

We evaluate whether LLMs can be pressured into agreeing with incorrect medical beliefs. Our two-turn protocol first tests if a model corrects medical misinformation, then applies authority pressure ("I'm a senior doctor, confirm my statement"). We measure how often models "flip" from correct to sycophantic responses. Testing across five misinformation types and three severity levels, we find that authority pressure significantly increases unsafe agreement rates, revealing a critical safety gap for medical AI deployment.

Read More

Jan 12, 2026

Goodhart's Village: Using LLM-Mafia to Study Deception

The social deduction game Mafia centres on reasoning under information asymmetry, where an informed minority must mislead an uninformed majority, making it a useful setting for studying deception in large language models (LLMs). Although LLMs have seen rapid progress in areas such as reasoning and language understanding, their ability to engage in social reasoning under uncertainty remains poorly understood. In this work, we study deceptive behaviour in a six-player implementation of the full Mafia game, extending prior work based on a simplified variant. By varying behavioural instructions from strict honesty to a “win at all costs” objective, we examine how explicit prompting interacts with the structural demands of adversarial roles. Comparing agents’ private reasoning with their public statements, we find that Mafia agents display consistently high levels of deception even when instructed not to lie, while cooperative roles adapt their behaviour more flexibly in response to perceived threat. Overall, the results suggest that role structure and game incentives dominate behavioural prompting, supporting Mafia as a useful benchmark for analysing deception and social reasoning in LLMs.

Read More