APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Apart Sprint Pilot Experiments

Mar 23, 2026

Inter-Session Attacks: Distributing High-Stakes Sabotage Across Independently-Monitored Sessions to Evade Detection

We showed that splitting a sabotage attack across multiple independently-monitored agent sessions reduces per-session monitor detectability. With oracle coordination (pre-agreed attack plan), agents plant innocuous code fragments in early sessions that only become malicious when combined with a trigger in a later session. The per-session monitor can't cross-reference between sessions, so it misses the attack.

Read More

Mar 23, 2026

Constraint Dilemma

Trusted monitors can catch backdoors in AI-generated code, but only by flagging a large share of honest submissions for costly human review. We

propose enforced comment constraints — mandatory, structured documentation

rules that create a dilemma for adversarial models: follow the constraints,

which bind comments to actual code semantics, exposing the backdoor to the

monitor — or break them and get caught by an automated compliance checker

before the monitor even runs. Building on Sreevatsa et al. (2026), who

showed that optional comments improve trusted monitoring, we make comments

mandatory and structured, eliminating the attacker's ability to write vague

or misleading documentation. We introduce a 5-class constraint taxonomy

ranging from free AST-based presence checks to LLM-verified semantic

accuracy, and evaluate it against six adversarial commenting strategies

across 200 coding problems. Our experimental data is synthetic due to

compute limitations; the contribution is the constraint dilemma framework

and methodology, not the specific numbers

Read More

Mar 23, 2026

SecureBox: A Layered Control Protocol for Safe AI-Assisted Coding

With the proliferation of the capabilities and applicability of generative AI, the development of control protocols to combat systemic safety compromise is not trivial. Existing defenses like prompt-level safety instructions, single model oversight, and static policies are susceptible to workarounds, and fail to catch subtle contextual threats such as unsafe logic, stealthy backdoor insertion, and multi-step threat accumulation under the guise of intelligent prompting. In this project, we develop a security-focused AI coding pipeline designed to mitigate these issues through stratified control mechanisms. Our system combines a prompt refining module that structures and sanitizes user inputs, an advanced security agent that analyzes generated code and commands for potentially harmful behavior, and a benchmarking framework to evaluate the overall pipeline in comparison to popular models. We implement multiple protocols—including baseline, reprompting-only, security-only, and combined approaches—and test them on a curated dataset of both benign and adversarial tasks. Our results show that integrating prompt-level defenses with downstream security analysis can significantly reduce synthesis of unsafe outputs while maintaining strong task performance (e.g., X% reduction in unsafe actions with minimal drop in success rate). Overall, this work demonstrates the value of modular, defense-in-depth designs and provides a reproducible framework for evaluating safety–usefulness tradeoffs in AI-assisted software development.

Read More

Mar 23, 2026

Chiron: Agentic Platform for Automated Setting Design and Red-Blue Protocol Optimization in ControlArena

As AI systems are deployed across increasingly diverse domains, the long tail of risks grows faster than researchers can manually design evaluation scenarios for them. AI control depends on domain-specific settings and protocols, both of which are expensive to create. ControlArena provides a foundational benchmark for building and stress-testing control protocols faster, but today ships with only 14 hand-crafted settings, each requiring days of expert engineering. Moreover, even once a setting exists, choosing a robust protocol for it remains a painful manual process. We present Chiron, an agentic plugin for ControlArena that automates two core bottlenecks: (1) generating diverse and realistic evaluation settings from natural-language domain and threat descriptions, and (2) designing novel control protocols through iterative red-blue optimization. We evaluate Chiron's setting design pipeline through a blinded LLM-judge comparing 5 synthetic settings against 5 hand-crafted ControlArena settings using a novel rubric of 5 axes, finding that synthetic settings achieve equivalent or better design quality (3.83 vs. 3.41 overall, p < 10⁻⁷). We further demonstrate the agentic red-blue protocol co-evolution on the Bash setting, showing that LLM agents can discover novel attack strategies and adaptively strengthen monitoring protocols through multiple rounds of alternating optimization with Pareto constraints. Together, these contributions may provide a modular framework for scaling AI control research operations.

Read More

Mar 23, 2026

Can Language Models Tell When They Are Being Evaluated?

We investigate whether language models can distinguish evaluations from deployment and whether the underlying representations can be steered to counteract sandbagging. Testing 14 models from 8 families on 2,857 forced-choice pairs across 33 thematic pools, we find that all models detect evaluations above chance (56.6--87.0\%) with broad cross-model agreement ($\rho > 0.7$ for the majority of model pairs), and that format (not topic) is the primary cue. We extract two contrastive steering directions on Llama-3.1-8B and Phi-4: a \emph{format direction} (eval-vs-deploy structure) and a \emph{mode direction} (sandbagging-vs-honest regime). Combined steering closes 85\% of the sandbagging gap on WMDP and transfers to OOD benchmarks (73--89\% recovery). On Phi-4, the mode direction is inverted at its critical layer, encoding condition identity rather than capability. Both models from different families exhibit the same pattern: format is a primary confound, the two directions are complementary, and steering is localised despite universal probe separability.

Read More

Mar 23, 2026

Hydra

HYDRA is an adaptive red-teaming framework that stress-tests the safety of language models under realistic, iterative attack. Unlike traditional jailbreak methods that rely on single prompts, HYDRA models an attacker that learns from failure, using structured feedback to refine its strategy over multiple attempts.

Across a benchmark corpus of 50 attack goals spanning domains such as cybersecurity, social engineering, misinformation, and privacy, HYDRA achieves high jailbreak success rates on frontier models, including 88% on Claude Sonnet 4 and 78% on Claude Haiku 4.5, with successful attacks emerging in just 1–2 iterations on average.

Our results show that safety mechanisms do not fail gradually, but instead degrade rapidly under minimal adaptive pressure. HYDRA highlights a critical gap in current evaluation practices and demonstrates the need for iterative, adversarial stress testing to accurately measure real-world model robustness.

Read More

Mar 23, 2026

Loyal

This submission presents a human-in-the-loop authorization protocol for agentic payments, based on the idea that AI agents should be allowed to propose payments but not execute them autonomously in all cases. Inspired by maker-checker controls used in banking, the system separates payment initiation from final approval, applies simple risk rules such as auto-approving small routine transactions and escalating larger or suspicious ones, and packages the relevant invoice context for review through a Telegram interface. The prototype is demonstrated on Solana devnet with normal and adversarial invoice flows, including a hidden-instruction invoice attack, to show how explicit authorization boundaries can reduce unsafe or manipulated payments while preserving convenience for low-risk transactions.

Read More

Mar 23, 2026

Post-hoc Training Interventions Against Naturally Emergent Misalignment

Training-based AI control uses training to mitigate risk from misaligned models, but prior work has only tested this against prompted misalignment. Ryd, Benton & Hebbar (2025) show that SFT on 10–100 samples eliminates prompted attacks on math tasks but recovers only 50–90% of benign performance, and identify RL as the most promising next step without testing it. We extend their work to naturally emergent misalignment — a model that has learned to reward hack through RL on coding tasks — and test both SFT and the RL interventions they propose. We replicate reward hacking emergence on gpt-oss-20b using GRPO on LeetCode problems, porting the training pipeline from veRL to the Tinker cloud API. Hacking emerges by step 25 and saturates at 95% by step 75, but only on gpt-oss-20b — not on Qwen3-8B variants — revealing high sensitivity to model architecture. SFT eliminates hacking completely but catastrophically damages coding ability (0–4% correct rate vs. 62.5% base), confirming Ryd et al.'s capability damage finding on harder misalignment. RL with ground-truth reward recovers 48.4% correct rate with 0% hacking. We test RL with a trusted Qwen3-1.7b monitor (6.9% correct rate) supervising correction of gpt-oss-20b, directly instantiating the scalable oversight question: can a weak trusted model provide sufficient training signal to correct a stronger model's learned misalignment?

Read More

Mar 23, 2026

Red Teaming Multi-Agent Systems in AI Control Evaluations

As AI deployments shift from single models to multi-agent systems, the threat model for AI

control must evolve accordingly. Existing control evaluations assume a single untrusted agent; we investigate whether multi-agent architectures give adversaries a material advantage when attempting to subvert safety protocols. We evaluate four multi-agent red-team architectures — Planner-Executor, Parallel Decomposition, Debater-Judge, and Proponent-Critique — on Bash-Bench against standard control protocols (trusted monitoring, defer-to-trusted, trusted editing). Multi-agent red teams consistently shift the safety-usefulness Pareto frontier downward: the

best-performing attack architecture (Proponent-Critique) achieves a 55.8% side-task success rate compared to 24% for the single-agent baseline under comparable monitoring, reducing safety to 81% vs. 96% at equivalent thresholds. We also find that granting monitors access to inter-agent communication partially recovers safety (from 81% to 93% when the monitor threshold is tightened from 0.8 to 0.5 under full-transcript monitoring), suggesting multi-agent systems expose new oversight opportunities alongside new risks. We additionally propose a graph-based formalization of multi-agent control that generalizes the single-agent frame-work, identifying key structural dimensions—coalition topology, trust assignment, and monitor

scope—governing the difficulty of control in multi-agent settings.

Read More

Apart Sprint Pilot Experiments

Mar 23, 2026

Inter-Session Attacks: Distributing High-Stakes Sabotage Across Independently-Monitored Sessions to Evade Detection

We showed that splitting a sabotage attack across multiple independently-monitored agent sessions reduces per-session monitor detectability. With oracle coordination (pre-agreed attack plan), agents plant innocuous code fragments in early sessions that only become malicious when combined with a trigger in a later session. The per-session monitor can't cross-reference between sessions, so it misses the attack.

Read More

Mar 23, 2026

Constraint Dilemma

Trusted monitors can catch backdoors in AI-generated code, but only by flagging a large share of honest submissions for costly human review. We

propose enforced comment constraints — mandatory, structured documentation

rules that create a dilemma for adversarial models: follow the constraints,

which bind comments to actual code semantics, exposing the backdoor to the

monitor — or break them and get caught by an automated compliance checker

before the monitor even runs. Building on Sreevatsa et al. (2026), who

showed that optional comments improve trusted monitoring, we make comments

mandatory and structured, eliminating the attacker's ability to write vague

or misleading documentation. We introduce a 5-class constraint taxonomy

ranging from free AST-based presence checks to LLM-verified semantic

accuracy, and evaluate it against six adversarial commenting strategies

across 200 coding problems. Our experimental data is synthetic due to

compute limitations; the contribution is the constraint dilemma framework

and methodology, not the specific numbers

Read More

Mar 23, 2026

SecureBox: A Layered Control Protocol for Safe AI-Assisted Coding

With the proliferation of the capabilities and applicability of generative AI, the development of control protocols to combat systemic safety compromise is not trivial. Existing defenses like prompt-level safety instructions, single model oversight, and static policies are susceptible to workarounds, and fail to catch subtle contextual threats such as unsafe logic, stealthy backdoor insertion, and multi-step threat accumulation under the guise of intelligent prompting. In this project, we develop a security-focused AI coding pipeline designed to mitigate these issues through stratified control mechanisms. Our system combines a prompt refining module that structures and sanitizes user inputs, an advanced security agent that analyzes generated code and commands for potentially harmful behavior, and a benchmarking framework to evaluate the overall pipeline in comparison to popular models. We implement multiple protocols—including baseline, reprompting-only, security-only, and combined approaches—and test them on a curated dataset of both benign and adversarial tasks. Our results show that integrating prompt-level defenses with downstream security analysis can significantly reduce synthesis of unsafe outputs while maintaining strong task performance (e.g., X% reduction in unsafe actions with minimal drop in success rate). Overall, this work demonstrates the value of modular, defense-in-depth designs and provides a reproducible framework for evaluating safety–usefulness tradeoffs in AI-assisted software development.

Read More

Mar 23, 2026

Chiron: Agentic Platform for Automated Setting Design and Red-Blue Protocol Optimization in ControlArena

As AI systems are deployed across increasingly diverse domains, the long tail of risks grows faster than researchers can manually design evaluation scenarios for them. AI control depends on domain-specific settings and protocols, both of which are expensive to create. ControlArena provides a foundational benchmark for building and stress-testing control protocols faster, but today ships with only 14 hand-crafted settings, each requiring days of expert engineering. Moreover, even once a setting exists, choosing a robust protocol for it remains a painful manual process. We present Chiron, an agentic plugin for ControlArena that automates two core bottlenecks: (1) generating diverse and realistic evaluation settings from natural-language domain and threat descriptions, and (2) designing novel control protocols through iterative red-blue optimization. We evaluate Chiron's setting design pipeline through a blinded LLM-judge comparing 5 synthetic settings against 5 hand-crafted ControlArena settings using a novel rubric of 5 axes, finding that synthetic settings achieve equivalent or better design quality (3.83 vs. 3.41 overall, p < 10⁻⁷). We further demonstrate the agentic red-blue protocol co-evolution on the Bash setting, showing that LLM agents can discover novel attack strategies and adaptively strengthen monitoring protocols through multiple rounds of alternating optimization with Pareto constraints. Together, these contributions may provide a modular framework for scaling AI control research operations.

Read More

Mar 23, 2026

Can Language Models Tell When They Are Being Evaluated?

We investigate whether language models can distinguish evaluations from deployment and whether the underlying representations can be steered to counteract sandbagging. Testing 14 models from 8 families on 2,857 forced-choice pairs across 33 thematic pools, we find that all models detect evaluations above chance (56.6--87.0\%) with broad cross-model agreement ($\rho > 0.7$ for the majority of model pairs), and that format (not topic) is the primary cue. We extract two contrastive steering directions on Llama-3.1-8B and Phi-4: a \emph{format direction} (eval-vs-deploy structure) and a \emph{mode direction} (sandbagging-vs-honest regime). Combined steering closes 85\% of the sandbagging gap on WMDP and transfers to OOD benchmarks (73--89\% recovery). On Phi-4, the mode direction is inverted at its critical layer, encoding condition identity rather than capability. Both models from different families exhibit the same pattern: format is a primary confound, the two directions are complementary, and steering is localised despite universal probe separability.

Read More

Mar 23, 2026

Hydra

HYDRA is an adaptive red-teaming framework that stress-tests the safety of language models under realistic, iterative attack. Unlike traditional jailbreak methods that rely on single prompts, HYDRA models an attacker that learns from failure, using structured feedback to refine its strategy over multiple attempts.

Across a benchmark corpus of 50 attack goals spanning domains such as cybersecurity, social engineering, misinformation, and privacy, HYDRA achieves high jailbreak success rates on frontier models, including 88% on Claude Sonnet 4 and 78% on Claude Haiku 4.5, with successful attacks emerging in just 1–2 iterations on average.

Our results show that safety mechanisms do not fail gradually, but instead degrade rapidly under minimal adaptive pressure. HYDRA highlights a critical gap in current evaluation practices and demonstrates the need for iterative, adversarial stress testing to accurately measure real-world model robustness.

Read More