APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights


GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Read More
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Nov 3, 2025
Forecasting Autonomous AI Bio-Threat Design Capabilities: Six Models Converge on 2031
This paper forecasts when frontier AI models will first achieve a critical threshold for autonomous biological threat design capabilities that cause existential risk. I develop a standardized 100-point evaluation framework and use superforecasting methodology with six independent quantitative models to analyze current AI capabilities in protein design, biosecurity screening, and autonomous research systems.
Read More
Read More
Nov 3, 2025
AI for Environmental Decision Intelligence - The AI Forecasting Hackathon
This project develops a real-time air quality forecasting system using live environmental indicators and historical datasets. By integrating Metaculus predictions with local pollutant measurements (CO and NO₂), the model leverages Monte Carlo Dropout to quantify uncertainty in forecasts. A deep learning model is fine-tuned on recent data to enhance predictive accuracy, while policy recommendations are generated based on conservative thresholds to guide actionable interventions. The pipeline includes automated data ingestion, uncertainty-aware predictions, and visualization-ready outputs, providing a robust framework for environmental monitoring and decision support.
Read More
Read More
Nov 3, 2025
Forecasting AGI: A Granular, CHC-Based Approach
The paper introduces a data-driven framework for forecasting Artificial General Intelligence (AGI) based on the Cattell-Horn-Carroll (CHC) theory of cognition. It breaks AGI into ten measurable cognitive domains and maps each to existing AI benchmarks. Using GPT-4 (2023) and projected GPT-5 (2025) data, the study applies exponential trend extrapolation to predict human-level proficiency across domains. Results show rapid progress in reading, writing, and math by 2028, but major bottlenecks in memory and reasoning until the 2030s. The approach provides a granular, reproducible, and governance-relevant method for tracking AI progress and informing strategic planning.
Read More
Read More
Nov 3, 2025
Table Top Agents
We present Tabletop Agents, an AI-powered framework that accelerates AI governance scenario exploration by orchestrating autonomous AI agents through structured tabletop exercises. Traditional policy wargaming takes years to iterate—RAND's cycles span 4 years for 43 exercises. AI capabilities advance faster than policy preparation can accommodate, creating a critical tempo mismatch. Tabletop Agents compresses preparation cycles from years to minutes while maintaining strategic fidelity. Our working prototype successfully orchestrates multi-agent, multi-turn scenarios where autonomous agents communicate via CLI, persist state in SQLite, and coordinate through turn-based phases. A 2-agent, 2-turn test scenario executed in 4:40 with 5 messages exchanged, demonstrating core orchestration mechanics. The framework enables researchers to run dozens of scenario variations per week instead of months between exercises, generating empirical data on AI governance strategic dynamics at scale. By automating the role-playing that traditionally requires extensive human coordination, we provide better data, faster iteration, and realistic practice during the critical pre-AGI window.
Read More
Read More
Nov 3, 2025
Beyond Capabilities: A Framework for Integrating Moral Patiency Indicators into AI Forecasting and Governance
Current AI forecasting focuses almost exclusively on capabilities and timelines, creating a dangerous blind spot for the potential emergence of moral patiency (e.g., sentience). This represents a critical governance failure, as an AI's moral status is a far more significant societal "branching point" than its task performance. Our project addresses this gap by proposing a novel, two-part framework. The first component is a proactive "dashboard" of early-warning indicators—drawing from behavioral science, computational neuroscience, and information theory—to begin monitoring for signals of moral patiency in frontier models. The second component is a tiered governance response system that links the detection of these indicators to specific, pre-planned policy actions, such as mandatory audits or training pauses. This framework transforms an abstract philosophical debate into a concrete, actionable problem of risk management, providing a vital tool for proactive and responsible AI governance.
Read More
Read More
Nov 3, 2025
AI’s Impact on Video and Game Generation
AI’s Impact on Video and Game Generation - short survey + forecasting
Read More
Read More
Nov 3, 2025
Empirical Measurements of Technique Effectiveness Across Model Sizes
We estimated the evolution of AI Safety techniques and demonstrated evidence of predictive power. We emphasize on the necessity of evaluating safety techniques across different model sizes to ensure their robustness and predictive power.
Read More
Read More
Nov 3, 2025
Economic Agency
The rapid adoption of autonomous AI agents is expected to give rise to new economic layers where agents transact and coordinate without human oversight (Hadfiled et al., 2025). The emergence of virtual agent economies presents a range of wide-reaching impacts to human economies, including several undesirable systemic risks such as reduced resource access, market distortion and job displacement. We further identify gradual disempowerment as a major multi-faceted risk to human actors.
In the literature, permeability between AI agent economies and human economies is presented as a key factor that will either enable risks to materialise, or serve as a protective barrier to human economies. A permeable economy is defined as one that allows for porous interaction and transaction with it by external actors. Conversely, an impermeable economy has boundaries that hermetically seal it off, insulating other economies from its influence. A recent paper by Google DeepMind called such an impermeable AI agent economy a ‘sandbox economy’.
Despite being a principal variable of influence between human and AI agent economies, permeability is not well-defined or understood in the literature. We have designed a conceptual framework to understand permeability - risks, affected objects, gates of permeability, levers to influence permeability.
The capabilities of AI agents to operate as economic actors will shape their roles in AI agent economies and human economies, which both affect human economies. On top of foundational model capabilities, an AI agent that exerts high economic influence can be expected to utilise capabilities across three dimensions: Generality, Autonomy, and Agency. We think it is crucial to intentionally design and measure AI agents in these dimensions, though we encountered difficulties based on inconsistent and overlapping definitions of Autonomy and Agency in the literature.
We present taxonomies to map the spectrum of Autonomy and Agency, offering insights into which the levels in chains of action AIs may be handed control from humans.
Read More
Read More
Nov 3, 2025
LLM-based scenario generation
In the report, several scenarios are generated with LLMs. The scenarios were automatically checked for dates of human-level AI and AI takeover. More diverse results were generated if explicitly asked for that. Requiring “probable” scenarios makes generated predictions more conservative.
Read More
Read More
Apart Sprint Pilot Experiments
Nov 3, 2025
Forecasting Autonomous AI Bio-Threat Design Capabilities: Six Models Converge on 2031
This paper forecasts when frontier AI models will first achieve a critical threshold for autonomous biological threat design capabilities that cause existential risk. I develop a standardized 100-point evaluation framework and use superforecasting methodology with six independent quantitative models to analyze current AI capabilities in protein design, biosecurity screening, and autonomous research systems.
Read More
Nov 3, 2025
AI for Environmental Decision Intelligence - The AI Forecasting Hackathon
This project develops a real-time air quality forecasting system using live environmental indicators and historical datasets. By integrating Metaculus predictions with local pollutant measurements (CO and NO₂), the model leverages Monte Carlo Dropout to quantify uncertainty in forecasts. A deep learning model is fine-tuned on recent data to enhance predictive accuracy, while policy recommendations are generated based on conservative thresholds to guide actionable interventions. The pipeline includes automated data ingestion, uncertainty-aware predictions, and visualization-ready outputs, providing a robust framework for environmental monitoring and decision support.
Read More
Nov 3, 2025
Forecasting AGI: A Granular, CHC-Based Approach
The paper introduces a data-driven framework for forecasting Artificial General Intelligence (AGI) based on the Cattell-Horn-Carroll (CHC) theory of cognition. It breaks AGI into ten measurable cognitive domains and maps each to existing AI benchmarks. Using GPT-4 (2023) and projected GPT-5 (2025) data, the study applies exponential trend extrapolation to predict human-level proficiency across domains. Results show rapid progress in reading, writing, and math by 2028, but major bottlenecks in memory and reasoning until the 2030s. The approach provides a granular, reproducible, and governance-relevant method for tracking AI progress and informing strategic planning.
Read More
Nov 3, 2025
Table Top Agents
We present Tabletop Agents, an AI-powered framework that accelerates AI governance scenario exploration by orchestrating autonomous AI agents through structured tabletop exercises. Traditional policy wargaming takes years to iterate—RAND's cycles span 4 years for 43 exercises. AI capabilities advance faster than policy preparation can accommodate, creating a critical tempo mismatch. Tabletop Agents compresses preparation cycles from years to minutes while maintaining strategic fidelity. Our working prototype successfully orchestrates multi-agent, multi-turn scenarios where autonomous agents communicate via CLI, persist state in SQLite, and coordinate through turn-based phases. A 2-agent, 2-turn test scenario executed in 4:40 with 5 messages exchanged, demonstrating core orchestration mechanics. The framework enables researchers to run dozens of scenario variations per week instead of months between exercises, generating empirical data on AI governance strategic dynamics at scale. By automating the role-playing that traditionally requires extensive human coordination, we provide better data, faster iteration, and realistic practice during the critical pre-AGI window.
Read More
Nov 3, 2025
Beyond Capabilities: A Framework for Integrating Moral Patiency Indicators into AI Forecasting and Governance
Current AI forecasting focuses almost exclusively on capabilities and timelines, creating a dangerous blind spot for the potential emergence of moral patiency (e.g., sentience). This represents a critical governance failure, as an AI's moral status is a far more significant societal "branching point" than its task performance. Our project addresses this gap by proposing a novel, two-part framework. The first component is a proactive "dashboard" of early-warning indicators—drawing from behavioral science, computational neuroscience, and information theory—to begin monitoring for signals of moral patiency in frontier models. The second component is a tiered governance response system that links the detection of these indicators to specific, pre-planned policy actions, such as mandatory audits or training pauses. This framework transforms an abstract philosophical debate into a concrete, actionable problem of risk management, providing a vital tool for proactive and responsible AI governance.
Read More
Nov 3, 2025
AI’s Impact on Video and Game Generation
AI’s Impact on Video and Game Generation - short survey + forecasting
Read More
Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Newsletter
Jun 17, 2025
Apart: Fundraiser Extended!
We've received another $462,276 since our last newsletter, making the total $597k of our $955k goal!
Read More


Our Impact
Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Newsletter
Jun 17, 2025
Apart: Fundraiser Extended!
We've received another $462,276 since our last newsletter, making the total $597k of our $955k goal!
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events