APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights


GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More


Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Read More
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Apart Sprint Pilot Experiments
Feb 2, 2026
Systematic Cross-Regulation Threat Topology for EU AI Governance
A single frontier AI training run can simultaneously trigger obligations under the EU AI Act, GDPR, Copyright Directive, and NIS, yet no systematic framework maps these compounding regulatory threats across stakeholder types and jurisdictions. We present a systematic threat topology covering 19 EU-level regulations across five stakeholder categories (frontier model developers, deployment platforms, hardware providers, open-source developers, and research organizations), with geographic enforcement modifiers for all 27 Member States. Our methodology employs an activity-based stakeholder taxonomy, temporal activation mapping, and a KNOW/GUESS/UNKNOWN epistemic framework that quantifies regulatory uncertainty rather than obscuring it. Key findings include: (1) cross-regulation compounding creates multiplicative compliance surfaces where identical development activities trigger 3–5 regulatory regimes simultaneously; (2) enforcement concentration: five DPAs account for over 85% of €5.88B in cumulative GDPR fines, creates significant compliance cost differentials depending on establishment jurisdiction; (3) temporal cascading between February 2025 and August 2027 activates obligations under four major regulatory categories in overlapping waves, with August 2025 marking a critical inflection point for frontier AI providers. We release the full 41-page threat matrix as open infrastructure for practitioners navigating EU AI compliance during this implementation period.
Read More
Read More
Feb 2, 2026
AEGIS
AEGIS solves the "Trust Deadlock" in global AI governance by enabling regulators to verify model safety without accessing proprietary weights. Acting as a "Digital IAEA," it uses Trusted Execution Environments (TEEs) to perform "blind" inspections that check for 1025 FLOP compute thresholds and CBRN risks. This protocol generates tamper-proof cryptographic proofs of compliance, shifting international oversight from unreliable self-reporting to a secure, zero-trust technical standard.
Read More
Read More
Feb 3, 2026
Maxwell
We built Maxwell, a governance simulator that proves Safety and Innovation doesn't have to be a trade-off. By treating governance as a thermodynamic system ('Incentive Engineering'), we modeled a 'Sovereign Subsidy' that keeps permit costs affordable ($0.84) while maintaining 100% security—solving the 'North Korea vs. Wild West' dilemma.
Read More
Read More
Feb 2, 2026
Technical AI Governance via an Agentic Bill of Materials and Risk Tiering
This paper proposes a technical governance framework for agentic AI systems, autonomous agents with tools, memory, and self-directed behaviour, that current regulations do not adequately address. It introduces an Agentic Bill of Materials (ABOM), a machine-readable manifest that documents an agent’s capabilities, autonomy, memory, and safety controls; a quantitative risk scoring formula that computes agent risk from agency, autonomy, persistence, and mitigation factors; and a Unified Agentic Risk Tiering (UART) system that maps agents into five governance tiers aligned with the EU AI Act and international safety standards. Combined with hardware-based attestation, the framework enables verifiable, enforceable AI governance by allowing regulators to cryptographically verify agent configurations rather than relying on self-reported claims, and is validated through an open-source reference implementation.
Read More
Read More
Feb 2, 2026
Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs
We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}
Read More
Read More
Feb 2, 2026
Prototyping an Embedded Off-Switch for AI Compute
This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.
Read More
Read More
Feb 2, 2026
AI Safety Template
A prototype for creating standardized AI safety evaluations that run in a hardened & private way
Read More
Read More
Feb 2, 2026
Verification Mechanism Feasibility Scorer (VMFS)
A decision-support framework and dashboard that scores AI verification mechanisms across feasibility dimensions to help policy makers, diplomats, technical AI governance, and related stakeholders design pragmatic, layered treaties for global AI risks.
Read More
Read More
Feb 2, 2026
DOMAIN OWNERSHIP PROBING
We propose Domain Ownership Probing (DOP), a lightweight verification method that evaluates a model’s internal representation structure instead of its stochastic text outputs. DomainProbe embeds domain-specific statements, forms prototype centroids, and computes domain ownership win-rate and cohesion to assess whether knowledge domains are consistently encoded. By adding an auto-tuned layer search, the method remains effective for both encoder models and decoder-only LLMs, supporting practical AI governance and compliance verification without exposing training datasets.
Read More
Read More
Apart Sprint Pilot Experiments
Feb 2, 2026
Systematic Cross-Regulation Threat Topology for EU AI Governance
A single frontier AI training run can simultaneously trigger obligations under the EU AI Act, GDPR, Copyright Directive, and NIS, yet no systematic framework maps these compounding regulatory threats across stakeholder types and jurisdictions. We present a systematic threat topology covering 19 EU-level regulations across five stakeholder categories (frontier model developers, deployment platforms, hardware providers, open-source developers, and research organizations), with geographic enforcement modifiers for all 27 Member States. Our methodology employs an activity-based stakeholder taxonomy, temporal activation mapping, and a KNOW/GUESS/UNKNOWN epistemic framework that quantifies regulatory uncertainty rather than obscuring it. Key findings include: (1) cross-regulation compounding creates multiplicative compliance surfaces where identical development activities trigger 3–5 regulatory regimes simultaneously; (2) enforcement concentration: five DPAs account for over 85% of €5.88B in cumulative GDPR fines, creates significant compliance cost differentials depending on establishment jurisdiction; (3) temporal cascading between February 2025 and August 2027 activates obligations under four major regulatory categories in overlapping waves, with August 2025 marking a critical inflection point for frontier AI providers. We release the full 41-page threat matrix as open infrastructure for practitioners navigating EU AI compliance during this implementation period.
Read More
Feb 2, 2026
AEGIS
AEGIS solves the "Trust Deadlock" in global AI governance by enabling regulators to verify model safety without accessing proprietary weights. Acting as a "Digital IAEA," it uses Trusted Execution Environments (TEEs) to perform "blind" inspections that check for 1025 FLOP compute thresholds and CBRN risks. This protocol generates tamper-proof cryptographic proofs of compliance, shifting international oversight from unreliable self-reporting to a secure, zero-trust technical standard.
Read More
Feb 3, 2026
Maxwell
We built Maxwell, a governance simulator that proves Safety and Innovation doesn't have to be a trade-off. By treating governance as a thermodynamic system ('Incentive Engineering'), we modeled a 'Sovereign Subsidy' that keeps permit costs affordable ($0.84) while maintaining 100% security—solving the 'North Korea vs. Wild West' dilemma.
Read More
Feb 2, 2026
Technical AI Governance via an Agentic Bill of Materials and Risk Tiering
This paper proposes a technical governance framework for agentic AI systems, autonomous agents with tools, memory, and self-directed behaviour, that current regulations do not adequately address. It introduces an Agentic Bill of Materials (ABOM), a machine-readable manifest that documents an agent’s capabilities, autonomy, memory, and safety controls; a quantitative risk scoring formula that computes agent risk from agency, autonomy, persistence, and mitigation factors; and a Unified Agentic Risk Tiering (UART) system that maps agents into five governance tiers aligned with the EU AI Act and international safety standards. Combined with hardware-based attestation, the framework enables verifiable, enforceable AI governance by allowing regulators to cryptographically verify agent configurations rather than relying on self-reported claims, and is validated through an open-source reference implementation.
Read More
Feb 2, 2026
Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs
We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}
Read More
Feb 2, 2026
Prototyping an Embedded Off-Switch for AI Compute
This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.
Read More
Community
Dec 3, 2025
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Our Impact
Community
Dec 3, 2025
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events