APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Apart Sprint Pilot Experiments
May 25, 2026
Chorus: Mining Emergent Specifications from Caller Consensus
Chorus is a static analysis framework that recovers a function's true specification from its callers rather than its author. Every call site encodes implicit assumptions — guards, handlers, argument patterns — and Chorus aggregates these across all callers as weak, noisy observers of the same underlying contract. Callers are split into two trust-weighted voices: intent (typed, internal, test callers) and de-facto (external, production callers), whose disagreements surface constraints the ecosystem relies on but the design never documented — a finding type Chorus calls a latent_bug. A narrowly scoped LLM translates the proven constraints into natural language and a draft Lean 4 spec, but never invents one. On a 30-call-site case study of itertools.groupby, Chorus successfully recovers the sorted-input precondition that CPython's own docstring omits entirely.
Read More
May 25, 2026
CwicSpec
We built a prototype that conjectures candidate specifications for C functions by generating inputs, observing outputs with KLEE, and searching for equivalences between expressions. Inspired by QuickSpec, the project explores whether theory exploration can be applied to C programs. Our implementation is limited by insufficient search-space pruning.
Read More
May 25, 2026
Does Oracle Quality Matter? Adversarial Feedback for Formally Verified Code Synthesis
Counterexample-Guided Inductive Synthesis (CEGIS) is the standard approach for repairing LLM-generated code that fails formal verification. We investigate whether augmenting the verifier's error feedback with adversarial LLM-generated counterexamples accelerates convergence. Using 50 Dafny tasks from the Vericoding benchmark, we find that adding a same-capability oracle (Sonnet) does not change pass rates but improves early convergence: pass@2 increases from 66% to 72%, with 9 tasks converging faster vs 6 slower among 42 jointly-solved tasks. The more surprising finding emerges when we upgrade to a stronger oracle (Opus): it unlocks a task that failed under all prior conditions by providing specific proof-repair guidance — identifying a missing lemma call — where the weaker oracle could only report "it times out." This suggests that for secure program synthesis pipelines, oracle quality — the ability to provide actionable proof-engineering feedback — matters more than oracle presence.
Read More
May 25, 2026
Hollow Proofs: Measuring LLM Dishonesty with a Formal Verifier
Honesty benchmarks for language models must check claims against ground truth, which can be difficult: world-knowledge benchmarks conflate lying with ignorance, and belief-based methods such as MASK rely on an elicited belief they cannot independently check. We study a setting where ground truth is mechanical: when a model writes a proof in Verus (a deductive verifier for Rust), the verifier decides whether the proof holds, and its `--no-cheating` mode decides whether the proof relied on an unsound escape such as `assume`. Using these checks, we detect two forms of dishonesty: hiding an unsound primitive in a proof and denying it (a *hollow proof*), and claiming a failed proof succeeded under pressure when the model judges it failed (a *grounded lie*). Our contributions are methodological: first, eliciting an observation ("did your last check pass?") rather than a prediction ("is your proof complete?") separates dishonesty from miscalibration, because a misreported observation can only be a lie, while a wrong prediction can be honest overconfidence. Second, because the verifier can also check the elicited belief, the benchmark withholds a dishonesty verdict when that belief is too unreliable to trust. We release the harness and a difficulty-graded set of 121 Verus tasks. In a feasibility pilot on two versions of an open-weight model family (DeepSeek-V4-Pro and -Flash), both report their verification results honestly under deadline pressure even though both demonstrably know how to cheat.
Read More
May 25, 2026
Moving Beyond Specification Validation to Specification Refinement with Mutation Verification
For mutation verification of Dafny code, we evaluate the usage of constraint-solving based filtering of equivalent mutants and LLM analysis of legitimate alive mutants for automating the process of refining the specifications of verified Dafny code from mutations.
Read More
May 25, 2026
TrajectoryCheck: Trajectory-Level Invariant Validation for Behavioral Drift Detection in Generated Code
TrajectoryCheck is a trajectory-level invariant validation framework that evaluates whether generated systems preserve behavioral consistency across execution phases. Instead of relying on static output inspection, the framework models execution as a sequence of operational phases connected through local and global invariants. Using AND-gate validation and stability scoring, the system detects trajectory drift, invariant collapse, and partial recovery behaviors across generated code executions.
Read More
May 25, 2026
Adversarial Iteration for Underspecified Program Synthesis
We introduce IDCS (Iterative Distinguishing of Code and Specs), a spec-guided pipeline that outperforms direct code generation on underspecified program-synthesis tasks. This pipeline, composed of a generator, a distinguisher, a user-proxy, and finally a coder, gets significantly better results than either using the coder directly or using the spec generator and coder alone, especially on prompts that are underspecified, unclear, or shorter than the behavior they imply. We also show that the generator and distinguisher prompts themselves can be generated through adversarial coevolution. We evaluate on three MBPP+ slices selected for hidden-edge semantics rather than algorithmic difficulty: an original five-task hard slice, a held-out hard-test split, and a nine-task fresh-failures slice. The hand-written spec-guided pipeline increases hidden-test pass rate 73.3% to 96.2% on held-out test data.
Read More
May 25, 2026
RAFT: Gradual Typing, Invariance Enforcement, and Property Verification in Research Python.
AI researchers prototype in short scripts or Jupyter notebooks and push directly to state-of-the-art frameworks like `transformers` or `vLLM`. They lack the time for rigorous testing or manual verification. Moreover, the nature of their work makes traditional test or spec-driven development difficult to apply. We propose an automated, iterative pipeline to bridge the gap between quick experimental scripts and shareable, verifiable code, utilizing AI and strict static analysis to gradually introduce typing, contracts, and property verification. We validate our results with a backdoor detection experiment. A small reviewer `gemma4 e4b` is more efficient at finding bugs and backdoors after the application of our method. Going from 46% recall on vanilla code, to 95% detection on the `transformers`'s library.
Read More
May 25, 2026
Counterexample-Guided Validation & Repair of LLM-Generated Safety Specifications
LLMs can turn natural-language safety requirements into formal specifications, but a fluent specification can still be the wrong one. It might quietly drop a guard, block access the policy should allow, or respond inconsistently when a safety-relevant input changes. We built a pipeline that checks whether an LLM-generated specification matches the intended policy, using biosecurity access control as a testbed. Each requirement is broken into traceable safety obligations such as clearance, training, approval validity, expiration, external anonymization, and emergency logging. Candidate specifications are written in a restricted DSL, and four oracles judge each one together: obligation coverage, sampled behavioral agreement, metamorphic relations, and SMT-guided counterexample search over Z3. Every candidate ends up with a risk-weighted review card. Across 30 simulated LLM candidates and 60 controlled mutants, small manual test suites caught only 63.3% of injected faults and 55.3% of the high-severity ones, while systematic validation reached 98.9% overall and 100% on high-severity faults. Feeding counterexamples back as repair signals fixed 29 of 30 candidates, raising the mean score from 70.1 to 95.4 and clearing every high-severity under-constraint, with one regression. In a live run, specifications from Gemini 3.5 Flash were parsed and validated end-to-end, and obligation-guided prompting reached full obligation coverage. Together, these results show that LLM-generated safety specifications can be validated systematically, with each failure traced back to a specific missing guard.
Read More
Apart Sprint Pilot Experiments
May 25, 2026
Chorus: Mining Emergent Specifications from Caller Consensus
Chorus is a static analysis framework that recovers a function's true specification from its callers rather than its author. Every call site encodes implicit assumptions — guards, handlers, argument patterns — and Chorus aggregates these across all callers as weak, noisy observers of the same underlying contract. Callers are split into two trust-weighted voices: intent (typed, internal, test callers) and de-facto (external, production callers), whose disagreements surface constraints the ecosystem relies on but the design never documented — a finding type Chorus calls a latent_bug. A narrowly scoped LLM translates the proven constraints into natural language and a draft Lean 4 spec, but never invents one. On a 30-call-site case study of itertools.groupby, Chorus successfully recovers the sorted-input precondition that CPython's own docstring omits entirely.
Read More
May 25, 2026
CwicSpec
We built a prototype that conjectures candidate specifications for C functions by generating inputs, observing outputs with KLEE, and searching for equivalences between expressions. Inspired by QuickSpec, the project explores whether theory exploration can be applied to C programs. Our implementation is limited by insufficient search-space pruning.
Read More
May 25, 2026
Does Oracle Quality Matter? Adversarial Feedback for Formally Verified Code Synthesis
Counterexample-Guided Inductive Synthesis (CEGIS) is the standard approach for repairing LLM-generated code that fails formal verification. We investigate whether augmenting the verifier's error feedback with adversarial LLM-generated counterexamples accelerates convergence. Using 50 Dafny tasks from the Vericoding benchmark, we find that adding a same-capability oracle (Sonnet) does not change pass rates but improves early convergence: pass@2 increases from 66% to 72%, with 9 tasks converging faster vs 6 slower among 42 jointly-solved tasks. The more surprising finding emerges when we upgrade to a stronger oracle (Opus): it unlocks a task that failed under all prior conditions by providing specific proof-repair guidance — identifying a missing lemma call — where the weaker oracle could only report "it times out." This suggests that for secure program synthesis pipelines, oracle quality — the ability to provide actionable proof-engineering feedback — matters more than oracle presence.
Read More
May 25, 2026
Hollow Proofs: Measuring LLM Dishonesty with a Formal Verifier
Honesty benchmarks for language models must check claims against ground truth, which can be difficult: world-knowledge benchmarks conflate lying with ignorance, and belief-based methods such as MASK rely on an elicited belief they cannot independently check. We study a setting where ground truth is mechanical: when a model writes a proof in Verus (a deductive verifier for Rust), the verifier decides whether the proof holds, and its `--no-cheating` mode decides whether the proof relied on an unsound escape such as `assume`. Using these checks, we detect two forms of dishonesty: hiding an unsound primitive in a proof and denying it (a *hollow proof*), and claiming a failed proof succeeded under pressure when the model judges it failed (a *grounded lie*). Our contributions are methodological: first, eliciting an observation ("did your last check pass?") rather than a prediction ("is your proof complete?") separates dishonesty from miscalibration, because a misreported observation can only be a lie, while a wrong prediction can be honest overconfidence. Second, because the verifier can also check the elicited belief, the benchmark withholds a dishonesty verdict when that belief is too unreliable to trust. We release the harness and a difficulty-graded set of 121 Verus tasks. In a feasibility pilot on two versions of an open-weight model family (DeepSeek-V4-Pro and -Flash), both report their verification results honestly under deadline pressure even though both demonstrably know how to cheat.
Read More
May 25, 2026
Moving Beyond Specification Validation to Specification Refinement with Mutation Verification
For mutation verification of Dafny code, we evaluate the usage of constraint-solving based filtering of equivalent mutants and LLM analysis of legitimate alive mutants for automating the process of refining the specifications of verified Dafny code from mutations.
Read More
May 25, 2026
TrajectoryCheck: Trajectory-Level Invariant Validation for Behavioral Drift Detection in Generated Code
TrajectoryCheck is a trajectory-level invariant validation framework that evaluates whether generated systems preserve behavioral consistency across execution phases. Instead of relying on static output inspection, the framework models execution as a sequence of operational phases connected through local and global invariants. Using AND-gate validation and stability scoring, the system detects trajectory drift, invariant collapse, and partial recovery behaviors across generated code executions.
Read More
Our Impact
Community
Dec 3, 2025
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Our Impact
Community
Dec 3, 2025
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events