Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Learn More

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Learn More

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Learn More

Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Learn More

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Learn More

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Learn More

Highlights

GPT-4o is capable of complex cyber offense tasks:

We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.

A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Factual model editing techniques don't edit facts:

Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.

J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023

Highlights

GPT-4o is capable of complex cyber offense tasks:

We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.

A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Factual model editing techniques don't edit facts:

Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.

J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023

Research Focus Areas

View All

AI Security & Safety

Key Papers:

Catastrophic Cyber Capabilities Benchmark

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

Sleeper Agents: Training Deceptive LLMs

Model Evaluation & Testing

Key Papers:

DarkBench: Benchmarking Dark Patterns in Large Language Models

Sandbag Detection through Model Impairment

Detecting Edit Failures in Large Language Models

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Mechanistic Interpretability

Key Papers:

Interpreting Learned Feedback Patterns in Large Language Models

Understanding Addition in Transformers

Neuron to Graph: Interpreting Language Model Neurons at Scale

Multi-Agent Systems

Key Papers:

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

Comprehensive report on multi-agent risks

Research Focus Areas

View All

AI Security & Safety

Key Papers:

Catastrophic Cyber Capabilities Benchmark

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

Sleeper Agents: Training Deceptive LLMs

Model Evaluation & Testing

Key Papers:

DarkBench: Benchmarking Dark Patterns in Large Language Models

Sandbag Detection through Model Impairment

Detecting Edit Failures in Large Language Models

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Mechanistic Interpretability

Key Papers:

Interpreting Learned Feedback Patterns in Large Language Models

Understanding Addition in Transformers

Neuron to Graph: Interpreting Language Model Neurons at Scale

Multi-Agent Systems

Key Papers:

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

Comprehensive report on multi-agent risks

Research Index

View All

NOV 18, 2024

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

NOV 2, 2024

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

oct 18, 2024

benchmarks

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

sep 25, 2024

Interpretability

Interpreting Learned Feedback Patterns in Large Language Models

feb 23, 2024

Interpretability

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

feb 4, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits

jan 14, 2024

conceptual

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

jan 3, 2024

Interpretability

Large Language Models Relearn Removed Concepts

nov 28, 2023

Interpretability

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

nov 23, 2023

Interpretability

Understanding addition in transformers

nov 7, 2023

Interpretability

Locating cross-task sequence continuation circuits in transformers

jul 10, 2023

benchmarks

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

may 5, 2023

Interpretability

Interpreting language model neurons at scale

Research Index

View All

NOV 18, 2024

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

NOV 2, 2024

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

oct 18, 2024

benchmarks

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

sep 25, 2024

Interpretability

Interpreting Learned Feedback Patterns in Large Language Models

feb 23, 2024

Interpretability

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

feb 4, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits

jan 14, 2024

conceptual

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

jan 3, 2024

Interpretability

Large Language Models Relearn Removed Concepts

nov 28, 2023

Interpretability

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

nov 23, 2023

Interpretability

Understanding addition in transformers

nov 7, 2023

Interpretability

Locating cross-task sequence continuation circuits in transformers

jul 10, 2023

benchmarks

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

may 5, 2023

Interpretability

Interpreting language model neurons at scale

Apart Sprint Pilot Experiments

View All

Jun 26, 2025

JASONTEST Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement

Guardian-Loop is a mechanistically interpretable judge system designed to enhance the Expert Orchestration Architecture through transparent and efficient safety evaluation. Targeting Track 1 (Judge Model Development), we train lightweight classifiers that pre-filter prompts for safety using a Llama 3.1 8B model, fine-tuning only the upper layers to directly output True or False responses. This avoids probe-head architectures, enabling native token-level interpretability and calibrated scoring. Achieving 85.0% accuracy and 94.6% AUC-ROC on a hold-out test set with low latency using the safety judge, the system is deployable on consumer hardware. Guardian-Loop integrates deep interpretability techniques, including token attribution, attention analysis, and circuit tracing, to expose the model’s internal decision-making; We also demonstrate the extensibility of our framework by applying it to adjacent judgment tasks, such as feasibility prediction. An open-ended adversarial framework based on MAP-Elites quality diversity optimization was proposed, designed to populate a 10×10 grid spanning risk types and evasion strategies. While not yet deployed, this framework could support continuous self-improvement and vulnerability discovery. Guardian-Loop illustrates how small-sized LLMs can be repurposed as efficient, transparent filters, supporting scalable and trustworthy AI deployments.

Jun 27, 2025

Red Teaming Policy 5 of A Narrow Path: Evaluating the Threat Resilience of AI Licensing Regimes

This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.

We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.

Our analysis suggests that, as written, Policy 5 is vulnerable to institutional influence and lacks sufficient safeguards to ensure enforcement. We recommend clearer accountability structures, built-in whistleblower protections, and stronger international coordination to make the policy more resilient.

Jun 27, 2025

Malicious Defense: Red Teaming Phase 0 of “A Narrow Path”

We use an iterative scenario red-teaming process to discuss key failures in the strict regulatory regime outlined in Phase 0 of “A Narrow Path,” and describe how a sufficiently insightful malicious company may achieve ASI in 20 years with moderate likelihood. We argue that such single-minded companies may easily avoid restriction through government-enforced opacity. Specifically, we outline defense contracting and national security work as a key sector of ASI vulnerability because of its tendencies towards compartmentalization, internationalization, and obfuscation, which provide ample opportunity to evade a governance scheme.

Jun 27, 2025

Phase 0 Reinforcement Toolkit

The Phase 0 Reinforcement Toolkit is a rapid-response governance package designed to address the five critical gaps in A Narrow Path's Phase 0 safety proposal before it reaches legislators. It includes four drop-in artifacts: an oversight org chart detailing mandates, funding, and reporting lines; a "catastrophic cascades" graphic illustrating potential economic and ecological losses; a carrots-and-sticks incentive menu aligning private returns with public safety; and a risk-communication playbook that translates technical risks into relatable stories. These tools enable lawmakers to transform safety ideals into enforceable, people-centered policies, strengthening Phase 0 while promoting equity, market stability, and public trust.

Jun 30, 2025

Red Teaming A Narrow Path: Treaty Enforcement in China

This report red-teams A Narrow Path’s international treaty proposal by stress-testing its assumptions in the Chinese context. It identifies key failure modes—regulatory capture, compute-based loopholes, and covert circumvention—and proposes adjustments to improve enforceability under real-world political conditions.

Jun 27, 2025

Red Teaming A Narrow Path - GeDiCa v2

While the 'Narrow Path' policy confronts the essential risk of recursive AI self-improvement, its proposed enforcement architecture relies on trust in a fundamentally non-cooperative and competitive domain. This strategic misalignment creates exploitable vulnerabilities.

Our analysis details six such weaknesses, including lack of verification, enforcement, and trust mechanisms, hardware-based circumvention via custom ASICs (e.g., Etched), issues with ‘direct uses’ of AI to improve AI, and a static compute cap that perversely incentivizes opaque and potentially risky algorithmic innovation.

To remedy these flaws, we propose a suite of mechanisms designed for a trustless environment. Key proposals include: replacing raw FLOPs with a benchmark-adjusted 'Effective FLOPs' (eFLOPs) metric to account for algorithmic gains; mandating secure R&D enclaves auditable via zero-knowledge proofs to protect intellectual property while ensuring compliance; and a 'Portfolio Licensing' framework to govern aggregate, combinatorial capabilities.

These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.

Jun 30, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint

All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that

the policies regarding the capabilities of AI systems lack technical soundness and that more incentives are needed to entice states to sign the treaty. Further, we discover a lack of equity in the licensing framework, and a lack of planning for black-swan events. We propose an oversight framework right from the manufacturing process of silicon chips. We also propose calling for a moratorium on the development of general AI systems until the existing tools for analyzing them can catch up. Following these recommendations still won't guarantee the prevention of ASI for 20 years, but ensures that the world is on track to even tackle such a system if it is somehow created.

Jun 27, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka

This report is a preliminary red-team evaluation of Phase 0 of the Narrow Path proposal. It uses the STPA framework to model the control environment that Phase 0 recommends and identifies control failures. Then, it uses the STRIDE framework to model how hostile actors may bypass certain control features. The discussion details suggestions as to how these gaps may be closed in the Narrow Path proposal.

Jun 27, 2025

A Narrow Line Edit: ControlAI Policy Sprint

Rather than explore specific policy questions in depth, we analyzed the presentation of the “Narrow Path” Phase 0 proposal as a whole. We considered factors like grammar, style, logical consistency, evidential support, comprehensiveness, and technical context.

Our analysis revealed patterns of insufficient support and unpolished style throughout the proposal. Overall, the proposal failed to demonstrate the rigor and specificity that is typically found in effective policy proposals. With effort to address these oversights (aided by our thorough annotations), the proposal could be significantly improved. These changes will also allow for deeper, narrower policy analysis to be integrated more effectively than is currently possible. For this reason, we expect our findings to multiply the efficacy of this policy sprint.

Apart Sprint Pilot Experiments

View All

Jun 26, 2025

JASONTEST Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement

Guardian-Loop is a mechanistically interpretable judge system designed to enhance the Expert Orchestration Architecture through transparent and efficient safety evaluation. Targeting Track 1 (Judge Model Development), we train lightweight classifiers that pre-filter prompts for safety using a Llama 3.1 8B model, fine-tuning only the upper layers to directly output True or False responses. This avoids probe-head architectures, enabling native token-level interpretability and calibrated scoring. Achieving 85.0% accuracy and 94.6% AUC-ROC on a hold-out test set with low latency using the safety judge, the system is deployable on consumer hardware. Guardian-Loop integrates deep interpretability techniques, including token attribution, attention analysis, and circuit tracing, to expose the model’s internal decision-making; We also demonstrate the extensibility of our framework by applying it to adjacent judgment tasks, such as feasibility prediction. An open-ended adversarial framework based on MAP-Elites quality diversity optimization was proposed, designed to populate a 10×10 grid spanning risk types and evasion strategies. While not yet deployed, this framework could support continuous self-improvement and vulnerability discovery. Guardian-Loop illustrates how small-sized LLMs can be repurposed as efficient, transparent filters, supporting scalable and trustworthy AI deployments.

Jun 27, 2025

Red Teaming Policy 5 of A Narrow Path: Evaluating the Threat Resilience of AI Licensing Regimes

This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.

We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.

Our analysis suggests that, as written, Policy 5 is vulnerable to institutional influence and lacks sufficient safeguards to ensure enforcement. We recommend clearer accountability structures, built-in whistleblower protections, and stronger international coordination to make the policy more resilient.

Jun 27, 2025

Malicious Defense: Red Teaming Phase 0 of “A Narrow Path”

We use an iterative scenario red-teaming process to discuss key failures in the strict regulatory regime outlined in Phase 0 of “A Narrow Path,” and describe how a sufficiently insightful malicious company may achieve ASI in 20 years with moderate likelihood. We argue that such single-minded companies may easily avoid restriction through government-enforced opacity. Specifically, we outline defense contracting and national security work as a key sector of ASI vulnerability because of its tendencies towards compartmentalization, internationalization, and obfuscation, which provide ample opportunity to evade a governance scheme.

Jun 27, 2025

Phase 0 Reinforcement Toolkit

The Phase 0 Reinforcement Toolkit is a rapid-response governance package designed to address the five critical gaps in A Narrow Path's Phase 0 safety proposal before it reaches legislators. It includes four drop-in artifacts: an oversight org chart detailing mandates, funding, and reporting lines; a "catastrophic cascades" graphic illustrating potential economic and ecological losses; a carrots-and-sticks incentive menu aligning private returns with public safety; and a risk-communication playbook that translates technical risks into relatable stories. These tools enable lawmakers to transform safety ideals into enforceable, people-centered policies, strengthening Phase 0 while promoting equity, market stability, and public trust.

Jun 30, 2025

Red Teaming A Narrow Path: Treaty Enforcement in China

This report red-teams A Narrow Path’s international treaty proposal by stress-testing its assumptions in the Chinese context. It identifies key failure modes—regulatory capture, compute-based loopholes, and covert circumvention—and proposes adjustments to improve enforceability under real-world political conditions.

Jun 27, 2025

Red Teaming A Narrow Path - GeDiCa v2

While the 'Narrow Path' policy confronts the essential risk of recursive AI self-improvement, its proposed enforcement architecture relies on trust in a fundamentally non-cooperative and competitive domain. This strategic misalignment creates exploitable vulnerabilities.

Our analysis details six such weaknesses, including lack of verification, enforcement, and trust mechanisms, hardware-based circumvention via custom ASICs (e.g., Etched), issues with ‘direct uses’ of AI to improve AI, and a static compute cap that perversely incentivizes opaque and potentially risky algorithmic innovation.

To remedy these flaws, we propose a suite of mechanisms designed for a trustless environment. Key proposals include: replacing raw FLOPs with a benchmark-adjusted 'Effective FLOPs' (eFLOPs) metric to account for algorithmic gains; mandating secure R&D enclaves auditable via zero-knowledge proofs to protect intellectual property while ensuring compliance; and a 'Portfolio Licensing' framework to govern aggregate, combinatorial capabilities.

These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.

Jun 30, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint

All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that

the policies regarding the capabilities of AI systems lack technical soundness and that more incentives are needed to entice states to sign the treaty. Further, we discover a lack of equity in the licensing framework, and a lack of planning for black-swan events. We propose an oversight framework right from the manufacturing process of silicon chips. We also propose calling for a moratorium on the development of general AI systems until the existing tools for analyzing them can catch up. Following these recommendations still won't guarantee the prevention of ASI for 20 years, but ensures that the world is on track to even tackle such a system if it is somehow created.

Jun 27, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka

This report is a preliminary red-team evaluation of Phase 0 of the Narrow Path proposal. It uses the STPA framework to model the control environment that Phase 0 recommends and identifies control failures. Then, it uses the STRIDE framework to model how hostile actors may bypass certain control features. The discussion details suggestions as to how these gaps may be closed in the Narrow Path proposal.

Jun 27, 2025

A Narrow Line Edit: ControlAI Policy Sprint

Rather than explore specific policy questions in depth, we analyzed the presentation of the “Narrow Path” Phase 0 proposal as a whole. We considered factors like grammar, style, logical consistency, evidential support, comprehensiveness, and technical context.

Our analysis revealed patterns of insufficient support and unpolished style throughout the proposal. Overall, the proposal failed to demonstrate the rigor and specificity that is typically found in effective policy proposals. With effort to address these oversights (aided by our thorough annotations), the proposal could be significantly improved. These changes will also allow for deeper, narrower policy analysis to be integrated more effectively than is currently possible. For this reason, we expect our findings to multiply the efficacy of this policy sprint.