APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Apart Sprint Pilot Experiments

May 25, 2026

Chorus: Mining Emergent Specifications from Caller Consensus

Chorus is a static analysis framework that recovers a function's true specification from its callers rather than its author. Every call site encodes implicit assumptions — guards, handlers, argument patterns — and Chorus aggregates these across all callers as weak, noisy observers of the same underlying contract. Callers are split into two trust-weighted voices: intent (typed, internal, test callers) and de-facto (external, production callers), whose disagreements surface constraints the ecosystem relies on but the design never documented — a finding type Chorus calls a latent_bug. A narrowly scoped LLM translates the proven constraints into natural language and a draft Lean 4 spec, but never invents one. On a 30-call-site case study of itertools.groupby, Chorus successfully recovers the sorted-input precondition that CPython's own docstring omits entirely.

Read More

May 25, 2026

CwicSpec

We built a prototype that conjectures candidate specifications for C functions by generating inputs, observing outputs with KLEE, and searching for equivalences between expressions. Inspired by QuickSpec, the project explores whether theory exploration can be applied to C programs. Our implementation is limited by insufficient search-space pruning.

Read More

May 25, 2026

Does Oracle Quality Matter? Adversarial Feedback for Formally Verified Code Synthesis

Counterexample-Guided Inductive Synthesis (CEGIS) is the standard approach for repairing LLM-generated code that fails formal verification. We investigate whether augmenting the verifier's error feedback with adversarial LLM-generated counterexamples accelerates convergence. Using 50 Dafny tasks from the Vericoding benchmark, we find that adding a same-capability oracle (Sonnet) does not change pass rates but improves early convergence: pass@2 increases from 66% to 72%, with 9 tasks converging faster vs 6 slower among 42 jointly-solved tasks. The more surprising finding emerges when we upgrade to a stronger oracle (Opus): it unlocks a task that failed under all prior conditions by providing specific proof-repair guidance — identifying a missing lemma call — where the weaker oracle could only report "it times out." This suggests that for secure program synthesis pipelines, oracle quality — the ability to provide actionable proof-engineering feedback — matters more than oracle presence.

Read More

May 25, 2026

Hollow Proofs: Measuring LLM Dishonesty with a Formal Verifier

Honesty benchmarks for language models must check claims against ground truth, which can be difficult: world-knowledge benchmarks conflate lying with ignorance, and belief-based methods such as MASK rely on an elicited belief they cannot independently check. We study a setting where ground truth is mechanical: when a model writes a proof in Verus (a deductive verifier for Rust), the verifier decides whether the proof holds, and its `--no-cheating` mode decides whether the proof relied on an unsound escape such as `assume`. Using these checks, we detect two forms of dishonesty: hiding an unsound primitive in a proof and denying it (a *hollow proof*), and claiming a failed proof succeeded under pressure when the model judges it failed (a *grounded lie*). Our contributions are methodological: first, eliciting an observation ("did your last check pass?") rather than a prediction ("is your proof complete?") separates dishonesty from miscalibration, because a misreported observation can only be a lie, while a wrong prediction can be honest overconfidence. Second, because the verifier can also check the elicited belief, the benchmark withholds a dishonesty verdict when that belief is too unreliable to trust. We release the harness and a difficulty-graded set of 121 Verus tasks. In a feasibility pilot on two versions of an open-weight model family (DeepSeek-V4-Pro and -Flash), both report their verification results honestly under deadline pressure even though both demonstrably know how to cheat.

Read More

May 25, 2026

Moving Beyond Specification Validation to Specification Refinement with Mutation Verification

For mutation verification of Dafny code, we evaluate the usage of constraint-solving based filtering of equivalent mutants and LLM analysis of legitimate alive mutants for automating the process of refining the specifications of verified Dafny code from mutations.

Read More

May 25, 2026

TrajectoryCheck: Trajectory-Level Invariant Validation for Behavioral Drift Detection in Generated Code

TrajectoryCheck is a trajectory-level invariant validation framework that evaluates whether generated systems preserve behavioral consistency across execution phases. Instead of relying on static output inspection, the framework models execution as a sequence of operational phases connected through local and global invariants. Using AND-gate validation and stability scoring, the system detects trajectory drift, invariant collapse, and partial recovery behaviors across generated code executions.

Read More

May 25, 2026

Adversarial Iteration for Underspecified Program Synthesis

We introduce IDCS (Iterative Distinguishing of Code and Specs), a spec-guided pipeline that outperforms direct code generation on underspecified program-synthesis tasks. This pipeline, composed of a generator, a distinguisher, a user-proxy, and finally a coder, gets significantly better results than either using the coder directly or using the spec generator and coder alone, especially on prompts that are underspecified, unclear, or shorter than the behavior they imply. We also show that the generator and distinguisher prompts themselves can be generated through adversarial coevolution. We evaluate on three MBPP+ slices selected for hidden-edge semantics rather than algorithmic difficulty: an original five-task hard slice, a held-out hard-test split, and a nine-task fresh-failures slice. The hand-written spec-guided pipeline increases hidden-test pass rate 73.3% to 96.2% on held-out test data.

Read More

May 25, 2026

RAFT: Gradual Typing, Invariance Enforcement, and Property Verification in Research Python.

AI researchers prototype in short scripts or Jupyter notebooks and push directly to state-of-the-art frameworks like `transformers` or `vLLM`. They lack the time for rigorous testing or manual verification. Moreover, the nature of their work makes traditional test or spec-driven development difficult to apply. We propose an automated, iterative pipeline to bridge the gap between quick experimental scripts and shareable, verifiable code, utilizing AI and strict static analysis to gradually introduce typing, contracts, and property verification. We validate our results with a backdoor detection experiment. A small reviewer `gemma4 e4b` is more efficient at finding bugs and backdoors after the application of our method. Going from 46% recall on vanilla code, to 95% detection on the `transformers`'s library.

Read More

May 25, 2026

Counterexample-Guided Validation & Repair of LLM-Generated Safety Specifications

LLMs can turn natural-language safety requirements into formal specifications, but a fluent specification can still be the wrong one. It might quietly drop a guard, block access the policy should allow, or respond inconsistently when a safety-relevant input changes. We built a pipeline that checks whether an LLM-generated specification matches the intended policy, using biosecurity access control as a testbed. Each requirement is broken into traceable safety obligations such as clearance, training, approval validity, expiration, external anonymization, and emergency logging. Candidate specifications are written in a restricted DSL, and four oracles judge each one together: obligation coverage, sampled behavioral agreement, metamorphic relations, and SMT-guided counterexample search over Z3. Every candidate ends up with a risk-weighted review card. Across 30 simulated LLM candidates and 60 controlled mutants, small manual test suites caught only 63.3% of injected faults and 55.3% of the high-severity ones, while systematic validation reached 98.9% overall and 100% on high-severity faults. Feeding counterexamples back as repair signals fixed 29 of 30 candidates, raising the mean score from 70.1 to 95.4 and clearing every high-severity under-constraint, with one regression. In a live run, specifications from Gemini 3.5 Flash were parsed and validated end-to-end, and obligation-guided prompting reached full obligation coverage. Together, these results show that LLM-generated safety specifications can be validated systematically, with each failure traced back to a specific missing guard.

Read More

Apart Sprint Pilot Experiments

May 25, 2026

Chorus: Mining Emergent Specifications from Caller Consensus

Chorus is a static analysis framework that recovers a function's true specification from its callers rather than its author. Every call site encodes implicit assumptions — guards, handlers, argument patterns — and Chorus aggregates these across all callers as weak, noisy observers of the same underlying contract. Callers are split into two trust-weighted voices: intent (typed, internal, test callers) and de-facto (external, production callers), whose disagreements surface constraints the ecosystem relies on but the design never documented — a finding type Chorus calls a latent_bug. A narrowly scoped LLM translates the proven constraints into natural language and a draft Lean 4 spec, but never invents one. On a 30-call-site case study of itertools.groupby, Chorus successfully recovers the sorted-input precondition that CPython's own docstring omits entirely.

Read More

May 25, 2026

CwicSpec

We built a prototype that conjectures candidate specifications for C functions by generating inputs, observing outputs with KLEE, and searching for equivalences between expressions. Inspired by QuickSpec, the project explores whether theory exploration can be applied to C programs. Our implementation is limited by insufficient search-space pruning.

Read More

May 25, 2026

Does Oracle Quality Matter? Adversarial Feedback for Formally Verified Code Synthesis

Counterexample-Guided Inductive Synthesis (CEGIS) is the standard approach for repairing LLM-generated code that fails formal verification. We investigate whether augmenting the verifier's error feedback with adversarial LLM-generated counterexamples accelerates convergence. Using 50 Dafny tasks from the Vericoding benchmark, we find that adding a same-capability oracle (Sonnet) does not change pass rates but improves early convergence: pass@2 increases from 66% to 72%, with 9 tasks converging faster vs 6 slower among 42 jointly-solved tasks. The more surprising finding emerges when we upgrade to a stronger oracle (Opus): it unlocks a task that failed under all prior conditions by providing specific proof-repair guidance — identifying a missing lemma call — where the weaker oracle could only report "it times out." This suggests that for secure program synthesis pipelines, oracle quality — the ability to provide actionable proof-engineering feedback — matters more than oracle presence.

Read More

May 25, 2026

Hollow Proofs: Measuring LLM Dishonesty with a Formal Verifier

Honesty benchmarks for language models must check claims against ground truth, which can be difficult: world-knowledge benchmarks conflate lying with ignorance, and belief-based methods such as MASK rely on an elicited belief they cannot independently check. We study a setting where ground truth is mechanical: when a model writes a proof in Verus (a deductive verifier for Rust), the verifier decides whether the proof holds, and its `--no-cheating` mode decides whether the proof relied on an unsound escape such as `assume`. Using these checks, we detect two forms of dishonesty: hiding an unsound primitive in a proof and denying it (a *hollow proof*), and claiming a failed proof succeeded under pressure when the model judges it failed (a *grounded lie*). Our contributions are methodological: first, eliciting an observation ("did your last check pass?") rather than a prediction ("is your proof complete?") separates dishonesty from miscalibration, because a misreported observation can only be a lie, while a wrong prediction can be honest overconfidence. Second, because the verifier can also check the elicited belief, the benchmark withholds a dishonesty verdict when that belief is too unreliable to trust. We release the harness and a difficulty-graded set of 121 Verus tasks. In a feasibility pilot on two versions of an open-weight model family (DeepSeek-V4-Pro and -Flash), both report their verification results honestly under deadline pressure even though both demonstrably know how to cheat.

Read More

May 25, 2026

Moving Beyond Specification Validation to Specification Refinement with Mutation Verification

For mutation verification of Dafny code, we evaluate the usage of constraint-solving based filtering of equivalent mutants and LLM analysis of legitimate alive mutants for automating the process of refining the specifications of verified Dafny code from mutations.

Read More

May 25, 2026

TrajectoryCheck: Trajectory-Level Invariant Validation for Behavioral Drift Detection in Generated Code

TrajectoryCheck is a trajectory-level invariant validation framework that evaluates whether generated systems preserve behavioral consistency across execution phases. Instead of relying on static output inspection, the framework models execution as a sequence of operational phases connected through local and global invariants. Using AND-gate validation and stability scoring, the system detects trajectory drift, invariant collapse, and partial recovery behaviors across generated code executions.

Read More