Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Learn More

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Learn More

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Learn More

Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Learn More

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Learn More

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Learn More

Highlights

GPT-4o is capable of complex cyber offense tasks:

We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.

A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Factual model editing techniques don't edit facts:

Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.

J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023

Highlights

GPT-4o is capable of complex cyber offense tasks:

We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.

A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Factual model editing techniques don't edit facts:

Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.

J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023

Research Focus Areas

View All

AI Security & Safety

Key Papers:

Catastrophic Cyber Capabilities Benchmark

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

Sleeper Agents: Training Deceptive LLMs

Model Evaluation & Testing

Key Papers:

DarkBench: Benchmarking Dark Patterns in Large Language Models

Sandbag Detection through Model Impairment

Detecting Edit Failures in Large Language Models

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Mechanistic Interpretability

Key Papers:

Interpreting Learned Feedback Patterns in Large Language Models

Understanding Addition in Transformers

Neuron to Graph: Interpreting Language Model Neurons at Scale

Multi-Agent Systems

Key Papers:

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

Comprehensive report on multi-agent risks

Research Focus Areas

View All

AI Security & Safety

Key Papers:

Catastrophic Cyber Capabilities Benchmark

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

Sleeper Agents: Training Deceptive LLMs

Model Evaluation & Testing

Key Papers:

DarkBench: Benchmarking Dark Patterns in Large Language Models

Sandbag Detection through Model Impairment

Detecting Edit Failures in Large Language Models

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Mechanistic Interpretability

Key Papers:

Interpreting Learned Feedback Patterns in Large Language Models

Understanding Addition in Transformers

Neuron to Graph: Interpreting Language Model Neurons at Scale

Multi-Agent Systems

Key Papers:

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

Comprehensive report on multi-agent risks

Research Index

View All

NOV 18, 2024

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

NOV 2, 2024

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

oct 18, 2024

benchmarks

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

sep 25, 2024

Interpretability

Interpreting Learned Feedback Patterns in Large Language Models

feb 23, 2024

Interpretability

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

feb 4, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits

jan 14, 2024

conceptual

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

jan 3, 2024

Interpretability

Large Language Models Relearn Removed Concepts

nov 28, 2023

Interpretability

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

nov 23, 2023

Interpretability

Understanding addition in transformers

nov 7, 2023

Interpretability

Locating cross-task sequence continuation circuits in transformers

jul 10, 2023

benchmarks

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

may 5, 2023

Interpretability

Interpreting language model neurons at scale

Research Index

View All

NOV 18, 2024

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

NOV 2, 2024

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

oct 18, 2024

benchmarks

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

sep 25, 2024

Interpretability

Interpreting Learned Feedback Patterns in Large Language Models

feb 23, 2024

Interpretability

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

feb 4, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits

jan 14, 2024

conceptual

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

jan 3, 2024

Interpretability

Large Language Models Relearn Removed Concepts

nov 28, 2023

Interpretability

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

nov 23, 2023

Interpretability

Understanding addition in transformers

nov 7, 2023

Interpretability

Locating cross-task sequence continuation circuits in transformers

jul 10, 2023

benchmarks

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

may 5, 2023

Interpretability

Interpreting language model neurons at scale

Apart Sprint Pilot Experiments

View All

Nov 3, 2025

Forecasting Autonomous AI Bio-Threat Design Capabilities: Six Models Converge on 2031

This paper forecasts when frontier AI models will first achieve a critical threshold for autonomous biological threat design capabilities that cause existential risk. I develop a standardized 100-point evaluation framework and use superforecasting methodology with six independent quantitative models to analyze current AI capabilities in protein design, biosecurity screening, and autonomous research systems.

Nov 3, 2025

AI for Environmental Decision Intelligence - The AI Forecasting Hackathon

This project develops a real-time air quality forecasting system using live environmental indicators and historical datasets. By integrating Metaculus predictions with local pollutant measurements (CO and NO₂), the model leverages Monte Carlo Dropout to quantify uncertainty in forecasts. A deep learning model is fine-tuned on recent data to enhance predictive accuracy, while policy recommendations are generated based on conservative thresholds to guide actionable interventions. The pipeline includes automated data ingestion, uncertainty-aware predictions, and visualization-ready outputs, providing a robust framework for environmental monitoring and decision support.

Nov 3, 2025

Forecasting AGI: A Granular, CHC-Based Approach

The paper introduces a data-driven framework for forecasting Artificial General Intelligence (AGI) based on the Cattell-Horn-Carroll (CHC) theory of cognition. It breaks AGI into ten measurable cognitive domains and maps each to existing AI benchmarks. Using GPT-4 (2023) and projected GPT-5 (2025) data, the study applies exponential trend extrapolation to predict human-level proficiency across domains. Results show rapid progress in reading, writing, and math by 2028, but major bottlenecks in memory and reasoning until the 2030s. The approach provides a granular, reproducible, and governance-relevant method for tracking AI progress and informing strategic planning.

Nov 3, 2025

Table Top Agents

We present Tabletop Agents, an AI-powered framework that accelerates AI governance scenario exploration by orchestrating autonomous AI agents through structured tabletop exercises. Traditional policy wargaming takes years to iterate—RAND's cycles span 4 years for 43 exercises. AI capabilities advance faster than policy preparation can accommodate, creating a critical tempo mismatch. Tabletop Agents compresses preparation cycles from years to minutes while maintaining strategic fidelity. Our working prototype successfully orchestrates multi-agent, multi-turn scenarios where autonomous agents communicate via CLI, persist state in SQLite, and coordinate through turn-based phases. A 2-agent, 2-turn test scenario executed in 4:40 with 5 messages exchanged, demonstrating core orchestration mechanics. The framework enables researchers to run dozens of scenario variations per week instead of months between exercises, generating empirical data on AI governance strategic dynamics at scale. By automating the role-playing that traditionally requires extensive human coordination, we provide better data, faster iteration, and realistic practice during the critical pre-AGI window.

Nov 3, 2025

Beyond Capabilities: A Framework for Integrating Moral Patiency Indicators into AI Forecasting and Governance

Current AI forecasting focuses almost exclusively on capabilities and timelines, creating a dangerous blind spot for the potential emergence of moral patiency (e.g., sentience). This represents a critical governance failure, as an AI's moral status is a far more significant societal "branching point" than its task performance. Our project addresses this gap by proposing a novel, two-part framework. The first component is a proactive "dashboard" of early-warning indicators—drawing from behavioral science, computational neuroscience, and information theory—to begin monitoring for signals of moral patiency in frontier models. The second component is a tiered governance response system that links the detection of these indicators to specific, pre-planned policy actions, such as mandatory audits or training pauses. This framework transforms an abstract philosophical debate into a concrete, actionable problem of risk management, providing a vital tool for proactive and responsible AI governance.

Nov 3, 2025

AI’s Impact on Video and Game Generation

AI’s Impact on Video and Game Generation - short survey + forecasting

Nov 3, 2025

Empirical Measurements of Technique Effectiveness Across Model Sizes

We estimated the evolution of AI Safety techniques and demonstrated evidence of predictive power. We emphasize on the necessity of evaluating safety techniques across different model sizes to ensure their robustness and predictive power.

Nov 3, 2025

Economic Agency

The rapid adoption of autonomous AI agents is expected to give rise to new economic layers where agents transact and coordinate without human oversight (Hadfiled et al., 2025). The emergence of virtual agent economies presents a range of wide-reaching impacts to human economies, including several undesirable systemic risks such as reduced resource access, market distortion and job displacement. We further identify gradual disempowerment as a major multi-faceted risk to human actors.

In the literature, permeability between AI agent economies and human economies is presented as a key factor that will either enable risks to materialise, or serve as a protective barrier to human economies. A permeable economy is defined as one that allows for porous interaction and transaction with it by external actors. Conversely, an impermeable economy has boundaries that hermetically seal it off, insulating other economies from its influence. A recent paper by Google DeepMind called such an impermeable AI agent economy a ‘sandbox economy’.

Despite being a principal variable of influence between human and AI agent economies, permeability is not well-defined or understood in the literature. We have designed a conceptual framework to understand permeability - risks, affected objects, gates of permeability, levers to influence permeability.

The capabilities of AI agents to operate as economic actors will shape their roles in AI agent economies and human economies, which both affect human economies. On top of foundational model capabilities, an AI agent that exerts high economic influence can be expected to utilise capabilities across three dimensions: Generality, Autonomy, and Agency. We think it is crucial to intentionally design and measure AI agents in these dimensions, though we encountered difficulties based on inconsistent and overlapping definitions of Autonomy and Agency in the literature.

We present taxonomies to map the spectrum of Autonomy and Agency, offering insights into which the levels in chains of action AIs may be handed control from humans.

Nov 3, 2025

LLM-based scenario generation

In the report, several scenarios are generated with LLMs. The scenarios were automatically checked for dates of human-level AI and AI takeover. More diverse results were generated if explicitly asked for that. Requiring “probable” scenarios makes generated predictions more conservative.

Apart Sprint Pilot Experiments

View All

Nov 3, 2025

Forecasting Autonomous AI Bio-Threat Design Capabilities: Six Models Converge on 2031

This paper forecasts when frontier AI models will first achieve a critical threshold for autonomous biological threat design capabilities that cause existential risk. I develop a standardized 100-point evaluation framework and use superforecasting methodology with six independent quantitative models to analyze current AI capabilities in protein design, biosecurity screening, and autonomous research systems.

Nov 3, 2025

AI for Environmental Decision Intelligence - The AI Forecasting Hackathon

This project develops a real-time air quality forecasting system using live environmental indicators and historical datasets. By integrating Metaculus predictions with local pollutant measurements (CO and NO₂), the model leverages Monte Carlo Dropout to quantify uncertainty in forecasts. A deep learning model is fine-tuned on recent data to enhance predictive accuracy, while policy recommendations are generated based on conservative thresholds to guide actionable interventions. The pipeline includes automated data ingestion, uncertainty-aware predictions, and visualization-ready outputs, providing a robust framework for environmental monitoring and decision support.

Nov 3, 2025

Forecasting AGI: A Granular, CHC-Based Approach

The paper introduces a data-driven framework for forecasting Artificial General Intelligence (AGI) based on the Cattell-Horn-Carroll (CHC) theory of cognition. It breaks AGI into ten measurable cognitive domains and maps each to existing AI benchmarks. Using GPT-4 (2023) and projected GPT-5 (2025) data, the study applies exponential trend extrapolation to predict human-level proficiency across domains. Results show rapid progress in reading, writing, and math by 2028, but major bottlenecks in memory and reasoning until the 2030s. The approach provides a granular, reproducible, and governance-relevant method for tracking AI progress and informing strategic planning.

Nov 3, 2025

Table Top Agents

We present Tabletop Agents, an AI-powered framework that accelerates AI governance scenario exploration by orchestrating autonomous AI agents through structured tabletop exercises. Traditional policy wargaming takes years to iterate—RAND's cycles span 4 years for 43 exercises. AI capabilities advance faster than policy preparation can accommodate, creating a critical tempo mismatch. Tabletop Agents compresses preparation cycles from years to minutes while maintaining strategic fidelity. Our working prototype successfully orchestrates multi-agent, multi-turn scenarios where autonomous agents communicate via CLI, persist state in SQLite, and coordinate through turn-based phases. A 2-agent, 2-turn test scenario executed in 4:40 with 5 messages exchanged, demonstrating core orchestration mechanics. The framework enables researchers to run dozens of scenario variations per week instead of months between exercises, generating empirical data on AI governance strategic dynamics at scale. By automating the role-playing that traditionally requires extensive human coordination, we provide better data, faster iteration, and realistic practice during the critical pre-AGI window.

Nov 3, 2025

Beyond Capabilities: A Framework for Integrating Moral Patiency Indicators into AI Forecasting and Governance

Current AI forecasting focuses almost exclusively on capabilities and timelines, creating a dangerous blind spot for the potential emergence of moral patiency (e.g., sentience). This represents a critical governance failure, as an AI's moral status is a far more significant societal "branching point" than its task performance. Our project addresses this gap by proposing a novel, two-part framework. The first component is a proactive "dashboard" of early-warning indicators—drawing from behavioral science, computational neuroscience, and information theory—to begin monitoring for signals of moral patiency in frontier models. The second component is a tiered governance response system that links the detection of these indicators to specific, pre-planned policy actions, such as mandatory audits or training pauses. This framework transforms an abstract philosophical debate into a concrete, actionable problem of risk management, providing a vital tool for proactive and responsible AI governance.

Nov 3, 2025