APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights


GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Read More
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Sep 15, 2025
When Guardrails Fail: Dual-Use Misuse of AI in Retrosynthesis Through Iterative Refinement–Induced Self-Jailbreaking
Artificial intelligence (AI) is transforming drug discovery, with large language models (LLMs) enabling rapid retrosynthesis planning. Yet these advances also pose dual-use risks, as adversarial prompting can redirect models toward generating harmful pathways. We evaluate Iterative Refinement Induced Self-Jailbreaking (IRIS), showing that while newer models resist more robustly, systems like GPT-4 can be induced to produce stepwise synthesis guidance. This underscores the fragility of guardrails and the urgency of continuous red-teaming. We argue that AI systems in drug discovery should be classified as high-risk under the EU AI Act and propose a severity-based governance framework to proportionately manage jailbreaks while safeguarding biomedical innovation.
Read More
Read More
Sep 15, 2025
Policy Brief: Harnessing Open-Source Intelligence for AI Risk Management
Artificial intelligence is advancing amid high uncertainty anddiverse risks, ranging from malicious uses such as AI-drivencyberattacks and CBRN threats to failures like hallucinations andsystemic impacts on labour markets and privacy. Yet only a smallfraction of global AI research addresses safety, creating anevidence dilemma in which regulators must act with limited data orrisk being overtaken by sudden capability leaps. Open-sourceintelligence (OSINT) platforms on AI risk offer a practical solutionby aggregating technical documentation, model benchmarks,incident reports, and safety practices to enable continuous,transparent, and shareable risk assessment. Integrating these toolsinto policymaking, regulatory oversight (e.g., EU AI Act), nationalsecurity planning, and public-sector innovation can enhancesituational awareness, strengthen compliance, foster public trust,and guide safe AI research and deployment.
Read More
Read More
Sep 15, 2025
CBRN-SAFE-Eval: Transparent Escalation Framework
How can we design a transparent, auditable framework for detectingand escalating CBRN-related risks in AI systems that balances real-time threat detection withstakeholder accountability requirements?
Read More
Read More
Sep 15, 2025
RobustCBRN Eval: A Practical Benchmark Robustification Toolkit
Current AI safety evaluations for CBRN risks contain systematicvulnerabilities, including statistical pattern exploitation, reproducibilitygaps, and transparency trade-offs, which can lead to seriousmisjudgments about model safety. RobustCBRN Eval addresses theseissues with a pipeline that integrates (1) Deep Ignorance consensusdetection across diverse models; (2) verified cloze scoring to reducemultiple-choice artifacts; and (3) statistical evaluation with bootstrapconfidence intervals for uncertainty quantification.In initial tests on WMDP benchmarks, the system revealed that modelaccuracy drops from ~66% to ~30% when question stems are removed,confirming that many items can be solved through superficial cues.Cloze-style scoring produced results consistent with full-formatquestions, and artifact filtering removed 30–40% of exploitable itemswhile reducing the longest-answer heuristic to under 30%. RobustCBRNEval runs 1,000–3,000 questions in under four hours for less than $300in compute, with variance under 2% across repeated runs.Key features include a resilient architecture that continues analysis underGPU failure, hash-based anonymization for reproducibility, andconfidence-aware evaluation that penalizes overconfident errors.Together, these results demonstrate that RobustCBRN Eval can identifybenchmark artifacts, improve robustness checks, and providereproducible, evidence-based safety evaluations of high-stakes AI models.
Read More
Read More
Sep 15, 2025
Towards Agnostic Viral Engineering Detection
Genetic engineering tools could potentially be misused to createharmful pathogens, making early detection of engineered virusescritical for biosecurity. We developed an agnostic geneticengineering detection system for viruses, simulating three types ofmodifications (deletions, inversions, and frameshift mutations)across 25 human-infecting viral genomes to create acomprehensive synthetic dataset. We benchmarked threeclassification approaches—k-mer-based logistic regression,BLAST alignment, and convolutional neural networks—alongsidean ensemble method. All models achieved F1-scores below 7%,suggesting that standard bioinformatics and machine learningapproaches are insufficient for robust detection of diverse viralengineering signatures.
Read More
Read More
Sep 15, 2025
ThoughtTrim
Large language models often generate long chain-of-thought (CoT) traces in which only a small subset of sentences materially influences the final answer. We propose ThoughtTrim: a simple evaluation framework that ranks CoT chunks by counterfactual-importance KL [1], reconstructs prompts using only the top-ranked chunks, and measures the accuracy-retention trade-off as filtering thresholds rise. Using Qwen2.5-1.5B on a 100-question Biology subset of MMLU-Pro, we find that (i) for some questions, KL-guided trimming preserves accuracy at substantial token savings (60-90% on many items), (ii) “first failure” thresholds are heterogeneous - some problems fail immediately while a long tail remains robust up to aggressive pruning, and (iii) a KL-shuffled control that preserves the number of kept chunks but breaks informativeness is consistently worse than the original selection, demonstrating the value of the ranking signal. We release a lightweight pipeline that utilizes the counterfactual-importance KL to understand the thresholds, efficiency frontiers, and failure distributions. This opens up future work in creating systems that are more deterministic, efficient, and robust through fine-tuning approaches leading to safer more deterministic approaches in agentic and LLM-based systems.
Read More
Read More
Sep 15, 2025
Arbiter: Automated Review of Bio-AI Tools for Emerging Risk
The rapid advancement of AI-enabled biological tools presents significantbiosecurity and AI safety challenges, necessitating scalable and balancedassessments. Building on the Global Risk Index (GRI) for AI-enabled BiologicalTools report's foundational framework, this work introduces Arbiter, anautomated pipeline designed to overcome the GRI's resource-intensive manualanalysis. Arbiter employs a multi-stage LLM-driven process to analyze scientificliterature, systematically evaluating AI-Bio tools for misuse risks and, notably,their potential benefits. This includes assessing economic impact, nationalcompetitiveness, and crisis response capabilities, providing policymakers with thecomprehensive, balanced insights required for informed decision-making. A pilotexecution demonstrated Arbiter's ability to efficiently monitor emerging tools,highlight areas for prompt and model refinement, and support the development offuture decision-support frameworks. Arbiter's modular and extensible designempowers users to tailor analyses, ensuring continuous, scalable, and adaptableoversight in this dynamic field.
Read More
Read More
Sep 15, 2025
Navigating Safety Measures for Nuclear Nonproliferation: AI-Enabled Early Warning Systems & Governance System For Detecting Nuclear Enrichment
The Independent Atomic Energy Agency (IAEA) has always sought ways to create a balance between nuclearuse for peaceful purposes and the prevention of weapons proliferation. Yet challenges persist in the form ofclandestine uranium enrichment, covert plutonium processing and illicit trade in sensitive technologies.Recent advances in the integration of Artificial Intelligence into nuclear Command, Control andCommunication (NC3) systems and procedures could help reduce errors from being made in crisis scenarios,enhance situational awareness, improve surveillance and increase operational efficiency. However, theintroduction of AI systems would also amplify existing challenges and create room for adversarialmanipulation where proliferators intentionally feed misleading signals to evade detection systems and reducefacility footprints. This brief critically examines the vulnerabilities and limitations in existing safeguards,recent case studies such as Iran’s suspension of IAEA cooperation and a multimodal AI-enabled monitoringapproach for early detection of nuclear enrichment activities. It concludes with governance approaches toreduce nuclear related misuse and policy measures to align AI capabilities with non-proliferation norms.
Read More
Read More
Sep 15, 2025
Molecules Under Watch: Multi-Modal AI Driven Threat Emergence Detection for Biosecurity
This study presents a comprehensive multi-modal pipeline for assessing biosecurity risks in chemical compounds, integrating real and synthetic datasets from public repositories such as ChEMBL, PubChem, and USPTO patents. The system leverages molecular descriptors extracted via RDKit, contextual embeddings from the Qwen-2.5 language reasoning model, and unsupervised anomaly detection using Isolation Forest to compute a novel Threat Emergence Detection (TED) score. This score quantifies dual-use potential, synthesis feasibility, and novelty, enabling scalable threat triage. We evaluate the pipeline on hybrid datasets, demonstrating robust differentiation between real pharmaceutical compounds and synthetic benchmarks. Our approach advances AI-driven CBRN (Chemical, Biological, Radiological, Nuclear) safety by providing interpretable risk metrics and constitutional oversight, with implications for regulatory compliance and dual-use research governance.
Read More
Read More
Apart Sprint Pilot Experiments
Sep 15, 2025
When Guardrails Fail: Dual-Use Misuse of AI in Retrosynthesis Through Iterative Refinement–Induced Self-Jailbreaking
Artificial intelligence (AI) is transforming drug discovery, with large language models (LLMs) enabling rapid retrosynthesis planning. Yet these advances also pose dual-use risks, as adversarial prompting can redirect models toward generating harmful pathways. We evaluate Iterative Refinement Induced Self-Jailbreaking (IRIS), showing that while newer models resist more robustly, systems like GPT-4 can be induced to produce stepwise synthesis guidance. This underscores the fragility of guardrails and the urgency of continuous red-teaming. We argue that AI systems in drug discovery should be classified as high-risk under the EU AI Act and propose a severity-based governance framework to proportionately manage jailbreaks while safeguarding biomedical innovation.
Read More
Sep 15, 2025
Policy Brief: Harnessing Open-Source Intelligence for AI Risk Management
Artificial intelligence is advancing amid high uncertainty anddiverse risks, ranging from malicious uses such as AI-drivencyberattacks and CBRN threats to failures like hallucinations andsystemic impacts on labour markets and privacy. Yet only a smallfraction of global AI research addresses safety, creating anevidence dilemma in which regulators must act with limited data orrisk being overtaken by sudden capability leaps. Open-sourceintelligence (OSINT) platforms on AI risk offer a practical solutionby aggregating technical documentation, model benchmarks,incident reports, and safety practices to enable continuous,transparent, and shareable risk assessment. Integrating these toolsinto policymaking, regulatory oversight (e.g., EU AI Act), nationalsecurity planning, and public-sector innovation can enhancesituational awareness, strengthen compliance, foster public trust,and guide safe AI research and deployment.
Read More
Sep 15, 2025
CBRN-SAFE-Eval: Transparent Escalation Framework
How can we design a transparent, auditable framework for detectingand escalating CBRN-related risks in AI systems that balances real-time threat detection withstakeholder accountability requirements?
Read More
Sep 15, 2025
RobustCBRN Eval: A Practical Benchmark Robustification Toolkit
Current AI safety evaluations for CBRN risks contain systematicvulnerabilities, including statistical pattern exploitation, reproducibilitygaps, and transparency trade-offs, which can lead to seriousmisjudgments about model safety. RobustCBRN Eval addresses theseissues with a pipeline that integrates (1) Deep Ignorance consensusdetection across diverse models; (2) verified cloze scoring to reducemultiple-choice artifacts; and (3) statistical evaluation with bootstrapconfidence intervals for uncertainty quantification.In initial tests on WMDP benchmarks, the system revealed that modelaccuracy drops from ~66% to ~30% when question stems are removed,confirming that many items can be solved through superficial cues.Cloze-style scoring produced results consistent with full-formatquestions, and artifact filtering removed 30–40% of exploitable itemswhile reducing the longest-answer heuristic to under 30%. RobustCBRNEval runs 1,000–3,000 questions in under four hours for less than $300in compute, with variance under 2% across repeated runs.Key features include a resilient architecture that continues analysis underGPU failure, hash-based anonymization for reproducibility, andconfidence-aware evaluation that penalizes overconfident errors.Together, these results demonstrate that RobustCBRN Eval can identifybenchmark artifacts, improve robustness checks, and providereproducible, evidence-based safety evaluations of high-stakes AI models.
Read More
Sep 15, 2025
Towards Agnostic Viral Engineering Detection
Genetic engineering tools could potentially be misused to createharmful pathogens, making early detection of engineered virusescritical for biosecurity. We developed an agnostic geneticengineering detection system for viruses, simulating three types ofmodifications (deletions, inversions, and frameshift mutations)across 25 human-infecting viral genomes to create acomprehensive synthetic dataset. We benchmarked threeclassification approaches—k-mer-based logistic regression,BLAST alignment, and convolutional neural networks—alongsidean ensemble method. All models achieved F1-scores below 7%,suggesting that standard bioinformatics and machine learningapproaches are insufficient for robust detection of diverse viralengineering signatures.
Read More
Sep 15, 2025
ThoughtTrim
Large language models often generate long chain-of-thought (CoT) traces in which only a small subset of sentences materially influences the final answer. We propose ThoughtTrim: a simple evaluation framework that ranks CoT chunks by counterfactual-importance KL [1], reconstructs prompts using only the top-ranked chunks, and measures the accuracy-retention trade-off as filtering thresholds rise. Using Qwen2.5-1.5B on a 100-question Biology subset of MMLU-Pro, we find that (i) for some questions, KL-guided trimming preserves accuracy at substantial token savings (60-90% on many items), (ii) “first failure” thresholds are heterogeneous - some problems fail immediately while a long tail remains robust up to aggressive pruning, and (iii) a KL-shuffled control that preserves the number of kept chunks but breaks informativeness is consistently worse than the original selection, demonstrating the value of the ranking signal. We release a lightweight pipeline that utilizes the counterfactual-importance KL to understand the thresholds, efficiency frontiers, and failure distributions. This opens up future work in creating systems that are more deterministic, efficient, and robust through fine-tuning approaches leading to safer more deterministic approaches in agentic and LLM-based systems.
Read More
Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Newsletter
Jun 17, 2025
Apart: Fundraiser Extended!
We've received another $462,276 since our last newsletter, making the total $597k of our $955k goal!
Read More


Our Impact
Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Newsletter
Jun 17, 2025
Apart: Fundraiser Extended!
We've received another $462,276 since our last newsletter, making the total $597k of our $955k goal!
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events