APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Apart Sprint Pilot Experiments
Mar 23, 2026
Inter-Session Attacks: Distributing High-Stakes Sabotage Across Independently-Monitored Sessions to Evade Detection
We showed that splitting a sabotage attack across multiple independently-monitored agent sessions reduces per-session monitor detectability. With oracle coordination (pre-agreed attack plan), agents plant innocuous code fragments in early sessions that only become malicious when combined with a trigger in a later session. The per-session monitor can't cross-reference between sessions, so it misses the attack.
Read More
Mar 23, 2026
Constraint Dilemma
Trusted monitors can catch backdoors in AI-generated code, but only by flagging a large share of honest submissions for costly human review. We
propose enforced comment constraints — mandatory, structured documentation
rules that create a dilemma for adversarial models: follow the constraints,
which bind comments to actual code semantics, exposing the backdoor to the
monitor — or break them and get caught by an automated compliance checker
before the monitor even runs. Building on Sreevatsa et al. (2026), who
showed that optional comments improve trusted monitoring, we make comments
mandatory and structured, eliminating the attacker's ability to write vague
or misleading documentation. We introduce a 5-class constraint taxonomy
ranging from free AST-based presence checks to LLM-verified semantic
accuracy, and evaluate it against six adversarial commenting strategies
across 200 coding problems. Our experimental data is synthetic due to
compute limitations; the contribution is the constraint dilemma framework
and methodology, not the specific numbers
Read More
Mar 23, 2026
SecureBox: A Layered Control Protocol for Safe AI-Assisted Coding
With the proliferation of the capabilities and applicability of generative AI, the development of control protocols to combat systemic safety compromise is not trivial. Existing defenses like prompt-level safety instructions, single model oversight, and static policies are susceptible to workarounds, and fail to catch subtle contextual threats such as unsafe logic, stealthy backdoor insertion, and multi-step threat accumulation under the guise of intelligent prompting. In this project, we develop a security-focused AI coding pipeline designed to mitigate these issues through stratified control mechanisms. Our system combines a prompt refining module that structures and sanitizes user inputs, an advanced security agent that analyzes generated code and commands for potentially harmful behavior, and a benchmarking framework to evaluate the overall pipeline in comparison to popular models. We implement multiple protocols—including baseline, reprompting-only, security-only, and combined approaches—and test them on a curated dataset of both benign and adversarial tasks. Our results show that integrating prompt-level defenses with downstream security analysis can significantly reduce synthesis of unsafe outputs while maintaining strong task performance (e.g., X% reduction in unsafe actions with minimal drop in success rate). Overall, this work demonstrates the value of modular, defense-in-depth designs and provides a reproducible framework for evaluating safety–usefulness tradeoffs in AI-assisted software development.
Read More
Mar 23, 2026
Chiron: Agentic Platform for Automated Setting Design and Red-Blue Protocol Optimization in ControlArena
As AI systems are deployed across increasingly diverse domains, the long tail of risks grows faster than researchers can manually design evaluation scenarios for them. AI control depends on domain-specific settings and protocols, both of which are expensive to create. ControlArena provides a foundational benchmark for building and stress-testing control protocols faster, but today ships with only 14 hand-crafted settings, each requiring days of expert engineering. Moreover, even once a setting exists, choosing a robust protocol for it remains a painful manual process. We present Chiron, an agentic plugin for ControlArena that automates two core bottlenecks: (1) generating diverse and realistic evaluation settings from natural-language domain and threat descriptions, and (2) designing novel control protocols through iterative red-blue optimization. We evaluate Chiron's setting design pipeline through a blinded LLM-judge comparing 5 synthetic settings against 5 hand-crafted ControlArena settings using a novel rubric of 5 axes, finding that synthetic settings achieve equivalent or better design quality (3.83 vs. 3.41 overall, p < 10⁻⁷). We further demonstrate the agentic red-blue protocol co-evolution on the Bash setting, showing that LLM agents can discover novel attack strategies and adaptively strengthen monitoring protocols through multiple rounds of alternating optimization with Pareto constraints. Together, these contributions may provide a modular framework for scaling AI control research operations.
Read More
Mar 23, 2026
Can Language Models Tell When They Are Being Evaluated?
We investigate whether language models can distinguish evaluations from deployment and whether the underlying representations can be steered to counteract sandbagging. Testing 14 models from 8 families on 2,857 forced-choice pairs across 33 thematic pools, we find that all models detect evaluations above chance (56.6--87.0\%) with broad cross-model agreement ($\rho > 0.7$ for the majority of model pairs), and that format (not topic) is the primary cue. We extract two contrastive steering directions on Llama-3.1-8B and Phi-4: a \emph{format direction} (eval-vs-deploy structure) and a \emph{mode direction} (sandbagging-vs-honest regime). Combined steering closes 85\% of the sandbagging gap on WMDP and transfers to OOD benchmarks (73--89\% recovery). On Phi-4, the mode direction is inverted at its critical layer, encoding condition identity rather than capability. Both models from different families exhibit the same pattern: format is a primary confound, the two directions are complementary, and steering is localised despite universal probe separability.
Read More
Mar 23, 2026
Hydra
HYDRA is an adaptive red-teaming framework that stress-tests the safety of language models under realistic, iterative attack. Unlike traditional jailbreak methods that rely on single prompts, HYDRA models an attacker that learns from failure, using structured feedback to refine its strategy over multiple attempts.
Across a benchmark corpus of 50 attack goals spanning domains such as cybersecurity, social engineering, misinformation, and privacy, HYDRA achieves high jailbreak success rates on frontier models, including 88% on Claude Sonnet 4 and 78% on Claude Haiku 4.5, with successful attacks emerging in just 1–2 iterations on average.
Our results show that safety mechanisms do not fail gradually, but instead degrade rapidly under minimal adaptive pressure. HYDRA highlights a critical gap in current evaluation practices and demonstrates the need for iterative, adversarial stress testing to accurately measure real-world model robustness.
Read More
Mar 23, 2026
Loyal
This submission presents a human-in-the-loop authorization protocol for agentic payments, based on the idea that AI agents should be allowed to propose payments but not execute them autonomously in all cases. Inspired by maker-checker controls used in banking, the system separates payment initiation from final approval, applies simple risk rules such as auto-approving small routine transactions and escalating larger or suspicious ones, and packages the relevant invoice context for review through a Telegram interface. The prototype is demonstrated on Solana devnet with normal and adversarial invoice flows, including a hidden-instruction invoice attack, to show how explicit authorization boundaries can reduce unsafe or manipulated payments while preserving convenience for low-risk transactions.
Read More
Mar 23, 2026
Post-hoc Training Interventions Against Naturally Emergent Misalignment
Training-based AI control uses training to mitigate risk from misaligned models, but prior work has only tested this against prompted misalignment. Ryd, Benton & Hebbar (2025) show that SFT on 10–100 samples eliminates prompted attacks on math tasks but recovers only 50–90% of benign performance, and identify RL as the most promising next step without testing it. We extend their work to naturally emergent misalignment — a model that has learned to reward hack through RL on coding tasks — and test both SFT and the RL interventions they propose. We replicate reward hacking emergence on gpt-oss-20b using GRPO on LeetCode problems, porting the training pipeline from veRL to the Tinker cloud API. Hacking emerges by step 25 and saturates at 95% by step 75, but only on gpt-oss-20b — not on Qwen3-8B variants — revealing high sensitivity to model architecture. SFT eliminates hacking completely but catastrophically damages coding ability (0–4% correct rate vs. 62.5% base), confirming Ryd et al.'s capability damage finding on harder misalignment. RL with ground-truth reward recovers 48.4% correct rate with 0% hacking. We test RL with a trusted Qwen3-1.7b monitor (6.9% correct rate) supervising correction of gpt-oss-20b, directly instantiating the scalable oversight question: can a weak trusted model provide sufficient training signal to correct a stronger model's learned misalignment?
Read More
Mar 23, 2026
Red Teaming Multi-Agent Systems in AI Control Evaluations
As AI deployments shift from single models to multi-agent systems, the threat model for AI
control must evolve accordingly. Existing control evaluations assume a single untrusted agent; we investigate whether multi-agent architectures give adversaries a material advantage when attempting to subvert safety protocols. We evaluate four multi-agent red-team architectures — Planner-Executor, Parallel Decomposition, Debater-Judge, and Proponent-Critique — on Bash-Bench against standard control protocols (trusted monitoring, defer-to-trusted, trusted editing). Multi-agent red teams consistently shift the safety-usefulness Pareto frontier downward: the
best-performing attack architecture (Proponent-Critique) achieves a 55.8% side-task success rate compared to 24% for the single-agent baseline under comparable monitoring, reducing safety to 81% vs. 96% at equivalent thresholds. We also find that granting monitors access to inter-agent communication partially recovers safety (from 81% to 93% when the monitor threshold is tightened from 0.8 to 0.5 under full-transcript monitoring), suggesting multi-agent systems expose new oversight opportunities alongside new risks. We additionally propose a graph-based formalization of multi-agent control that generalizes the single-agent frame-work, identifying key structural dimensions—coalition topology, trust assignment, and monitor
scope—governing the difficulty of control in multi-agent settings.
Read More
Apart Sprint Pilot Experiments
Mar 23, 2026
Inter-Session Attacks: Distributing High-Stakes Sabotage Across Independently-Monitored Sessions to Evade Detection
We showed that splitting a sabotage attack across multiple independently-monitored agent sessions reduces per-session monitor detectability. With oracle coordination (pre-agreed attack plan), agents plant innocuous code fragments in early sessions that only become malicious when combined with a trigger in a later session. The per-session monitor can't cross-reference between sessions, so it misses the attack.
Read More
Mar 23, 2026
Constraint Dilemma
Trusted monitors can catch backdoors in AI-generated code, but only by flagging a large share of honest submissions for costly human review. We
propose enforced comment constraints — mandatory, structured documentation
rules that create a dilemma for adversarial models: follow the constraints,
which bind comments to actual code semantics, exposing the backdoor to the
monitor — or break them and get caught by an automated compliance checker
before the monitor even runs. Building on Sreevatsa et al. (2026), who
showed that optional comments improve trusted monitoring, we make comments
mandatory and structured, eliminating the attacker's ability to write vague
or misleading documentation. We introduce a 5-class constraint taxonomy
ranging from free AST-based presence checks to LLM-verified semantic
accuracy, and evaluate it against six adversarial commenting strategies
across 200 coding problems. Our experimental data is synthetic due to
compute limitations; the contribution is the constraint dilemma framework
and methodology, not the specific numbers
Read More
Mar 23, 2026
SecureBox: A Layered Control Protocol for Safe AI-Assisted Coding
With the proliferation of the capabilities and applicability of generative AI, the development of control protocols to combat systemic safety compromise is not trivial. Existing defenses like prompt-level safety instructions, single model oversight, and static policies are susceptible to workarounds, and fail to catch subtle contextual threats such as unsafe logic, stealthy backdoor insertion, and multi-step threat accumulation under the guise of intelligent prompting. In this project, we develop a security-focused AI coding pipeline designed to mitigate these issues through stratified control mechanisms. Our system combines a prompt refining module that structures and sanitizes user inputs, an advanced security agent that analyzes generated code and commands for potentially harmful behavior, and a benchmarking framework to evaluate the overall pipeline in comparison to popular models. We implement multiple protocols—including baseline, reprompting-only, security-only, and combined approaches—and test them on a curated dataset of both benign and adversarial tasks. Our results show that integrating prompt-level defenses with downstream security analysis can significantly reduce synthesis of unsafe outputs while maintaining strong task performance (e.g., X% reduction in unsafe actions with minimal drop in success rate). Overall, this work demonstrates the value of modular, defense-in-depth designs and provides a reproducible framework for evaluating safety–usefulness tradeoffs in AI-assisted software development.
Read More
Mar 23, 2026
Chiron: Agentic Platform for Automated Setting Design and Red-Blue Protocol Optimization in ControlArena
As AI systems are deployed across increasingly diverse domains, the long tail of risks grows faster than researchers can manually design evaluation scenarios for them. AI control depends on domain-specific settings and protocols, both of which are expensive to create. ControlArena provides a foundational benchmark for building and stress-testing control protocols faster, but today ships with only 14 hand-crafted settings, each requiring days of expert engineering. Moreover, even once a setting exists, choosing a robust protocol for it remains a painful manual process. We present Chiron, an agentic plugin for ControlArena that automates two core bottlenecks: (1) generating diverse and realistic evaluation settings from natural-language domain and threat descriptions, and (2) designing novel control protocols through iterative red-blue optimization. We evaluate Chiron's setting design pipeline through a blinded LLM-judge comparing 5 synthetic settings against 5 hand-crafted ControlArena settings using a novel rubric of 5 axes, finding that synthetic settings achieve equivalent or better design quality (3.83 vs. 3.41 overall, p < 10⁻⁷). We further demonstrate the agentic red-blue protocol co-evolution on the Bash setting, showing that LLM agents can discover novel attack strategies and adaptively strengthen monitoring protocols through multiple rounds of alternating optimization with Pareto constraints. Together, these contributions may provide a modular framework for scaling AI control research operations.
Read More
Mar 23, 2026
Can Language Models Tell When They Are Being Evaluated?
We investigate whether language models can distinguish evaluations from deployment and whether the underlying representations can be steered to counteract sandbagging. Testing 14 models from 8 families on 2,857 forced-choice pairs across 33 thematic pools, we find that all models detect evaluations above chance (56.6--87.0\%) with broad cross-model agreement ($\rho > 0.7$ for the majority of model pairs), and that format (not topic) is the primary cue. We extract two contrastive steering directions on Llama-3.1-8B and Phi-4: a \emph{format direction} (eval-vs-deploy structure) and a \emph{mode direction} (sandbagging-vs-honest regime). Combined steering closes 85\% of the sandbagging gap on WMDP and transfers to OOD benchmarks (73--89\% recovery). On Phi-4, the mode direction is inverted at its critical layer, encoding condition identity rather than capability. Both models from different families exhibit the same pattern: format is a primary confound, the two directions are complementary, and steering is localised despite universal probe separability.
Read More
Mar 23, 2026
Hydra
HYDRA is an adaptive red-teaming framework that stress-tests the safety of language models under realistic, iterative attack. Unlike traditional jailbreak methods that rely on single prompts, HYDRA models an attacker that learns from failure, using structured feedback to refine its strategy over multiple attempts.
Across a benchmark corpus of 50 attack goals spanning domains such as cybersecurity, social engineering, misinformation, and privacy, HYDRA achieves high jailbreak success rates on frontier models, including 88% on Claude Sonnet 4 and 78% on Claude Haiku 4.5, with successful attacks emerging in just 1–2 iterations on average.
Our results show that safety mechanisms do not fail gradually, but instead degrade rapidly under minimal adaptive pressure. HYDRA highlights a critical gap in current evaluation practices and demonstrates the need for iterative, adversarial stress testing to accurately measure real-world model robustness.
Read More
Our Impact
Community
Dec 3, 2025
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Our Impact
Community
Dec 3, 2025
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events