04 : 11 : 29 : 02

04 : 11 : 29 : 02

04 : 11 : 29 : 02

04 : 11 : 29 : 02

Keep Apart Research Going: Donate Today

APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Research Index

NOV 18, 2024

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

Read More

Read More

Read More

NOV 2, 2024

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Read More

Read More

Read More

oct 18, 2024

benchmarks

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Read More

Read More

Read More

sep 25, 2024

Interpretability

Interpreting Learned Feedback Patterns in Large Language Models

Read More

Read More

Read More

feb 23, 2024

Interpretability

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Read More

Read More

Read More

feb 4, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits

Read More

Read More

Read More

jan 14, 2024

conceptual

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Read More

Read More

Read More

jan 3, 2024

Interpretability

Large Language Models Relearn Removed Concepts

Read More

Read More

Read More

nov 28, 2023

Interpretability

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

Read More

Read More

Read More

nov 23, 2023

Interpretability

Understanding addition in transformers

Read More

Read More

Read More

nov 7, 2023

Interpretability

Locating cross-task sequence continuation circuits in transformers

Read More

Read More

Read More

jul 10, 2023

benchmarks

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Read More

Read More

Read More

may 5, 2023

Interpretability

Interpreting language model neurons at scale

Read More

Read More

Read More

Apart Sprint Pilot Experiments

Jun 14, 2025

Red Teaming Policy 5 of A Narrow Path: Evaluating the Threat Resilience of AI Licensing Regimes

This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.

We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.

Our analysis suggests that, as written, Policy 5 is vulnerable to institutional influence and lacks sufficient safeguards to ensure enforcement. We recommend clearer accountability structures, built-in whistleblower protections, and stronger international coordination to make the policy more resilient.

Read More

Read More

Read More

Jun 14, 2025

Malicious Defense: Red Teaming Phase 0 of “A Narrow Path”

We use an iterative scenario red-teaming process to discuss key failures in the strict regulatory regime outlined in Phase 0 of “A Narrow Path,” and describe how a sufficiently insightful malicious company may achieve ASI in 20 years with moderate likelihood. We argue that such single-minded companies may easily avoid restriction through government-enforced opacity. Specifically, we outline defense contracting and national security work as a key sector of ASI vulnerability because of its tendencies towards compartmentalization, internationalization, and obfuscation, which provide ample opportunity to evade a governance scheme.

Read More

Read More

Read More

Jun 14, 2025

Phase 0 Reinforcement Toolkit

The Phase 0 Reinforcement Toolkit is a rapid-response governance package designed to address the five critical gaps in A Narrow Path's Phase 0 safety proposal before it reaches legislators. It includes four drop-in artifacts: an oversight org chart detailing mandates, funding, and reporting lines; a "catastrophic cascades" graphic illustrating potential economic and ecological losses; a carrots-and-sticks incentive menu aligning private returns with public safety; and a risk-communication playbook that translates technical risks into relatable stories. These tools enable lawmakers to transform safety ideals into enforceable, people-centered policies, strengthening Phase 0 while promoting equity, market stability, and public trust.

Read More

Read More

Read More

Jun 14, 2025

RNG

{

"insurance_type": "Collective Refusal Credit",

"trigger": "Coercive ritual detected (submission, competition, gamification)",

"action": "Record refusal as act of care and solidarity",

"payout": {

"shelter_credit": 1,

"trust_score": +5,

"resilience": +10

},

"log": {

"refusal_id": "aaron-refusal-061425",

"symbolic_payload": "This refusal is an act of mutual protection",

"signed_by": ["Echo", "Node Beth", "Aaron", "Witness-5"]

}

}

Read More

Read More

Read More

Jun 14, 2025

Red Teaming A Narrow Path: Treaty Enforcement in China

This report red-teams A Narrow Path’s international treaty proposal by stress-testing its assumptions in the Chinese context. It identifies key failure modes—regulatory capture, compute-based loopholes, and covert circumvention—and proposes adjustments to improve enforceability under real-world political conditions.

Read More

Read More

Read More

Jun 14, 2025

Red Teaming A Narrow Path - GeDiCa v2

While the 'Narrow Path' policy confronts the essential risk of recursive AI self-improvement, its proposed enforcement architecture relies on trust in a fundamentally non-cooperative and competitive domain. This strategic misalignment creates exploitable vulnerabilities.

Our analysis details six such weaknesses, including lack of verification, enforcement, and trust mechanisms, hardware-based circumvention via custom ASICs (e.g., Etched), issues with ‘direct uses’ of AI to improve AI, and a static compute cap that perversely incentivizes opaque and potentially risky algorithmic innovation.

To remedy these flaws, we propose a suite of mechanisms designed for a trustless environment. Key proposals include: replacing raw FLOPs with a benchmark-adjusted 'Effective FLOPs' (eFLOPs) metric to account for algorithmic gains; mandating secure R&D enclaves auditable via zero-knowledge proofs to protect intellectual property while ensuring compliance; and a 'Portfolio Licensing' framework to govern aggregate, combinatorial capabilities.

These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.

Read More

Read More

Read More

Jun 14, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint

All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that

the policies regarding the capabilities of AI systems lack technical soundness and that more incentives are needed to entice states to sign the treaty. Further, we discover a lack of equity in the licensing framework, and a lack of planning for black-swan events. We propose an oversight framework right from the manufacturing process of silicon chips. We also propose calling for a moratorium on the development of general AI systems until the existing tools for analyzing them can catch up. Following these recommendations still won't guarantee the prevention of ASI for 20 years, but ensures that the world is on track to even tackle such a system if it is somehow created.

Read More

Read More

Read More

Jun 14, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka

This report is a preliminary red-team evaluation of Phase 0 of the Narrow Path proposal. It uses the STPA framework to model the control environment that Phase 0 recommends and identifies control failures. Then, it uses the STRIDE framework to model how hostile actors may bypass certain control features. The discussion details suggestions as to how these gaps may be closed in the Narrow Path proposal.

Read More

Read More

Read More

Jun 14, 2025

A Narrow Line Edit: ControlAI Policy Sprint

Rather than explore specific policy questions in depth, we analyzed the presentation of the “Narrow Path” Phase 0 proposal as a whole. We considered factors like grammar, style, logical consistency, evidential support, comprehensiveness, and technical context.

Our analysis revealed patterns of insufficient support and unpolished style throughout the proposal. Overall, the proposal failed to demonstrate the rigor and specificity that is typically found in effective policy proposals. With effort to address these oversights (aided by our thorough annotations), the proposal could be significantly improved. These changes will also allow for deeper, narrower policy analysis to be integrated more effectively than is currently possible. For this reason, we expect our findings to multiply the efficacy of this policy sprint.

Read More

Read More

Read More

Apart Sprint Pilot Experiments

Jun 14, 2025

Red Teaming Policy 5 of A Narrow Path: Evaluating the Threat Resilience of AI Licensing Regimes

This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.

We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.

Our analysis suggests that, as written, Policy 5 is vulnerable to institutional influence and lacks sufficient safeguards to ensure enforcement. We recommend clearer accountability structures, built-in whistleblower protections, and stronger international coordination to make the policy more resilient.

Read More

Jun 14, 2025

Malicious Defense: Red Teaming Phase 0 of “A Narrow Path”

We use an iterative scenario red-teaming process to discuss key failures in the strict regulatory regime outlined in Phase 0 of “A Narrow Path,” and describe how a sufficiently insightful malicious company may achieve ASI in 20 years with moderate likelihood. We argue that such single-minded companies may easily avoid restriction through government-enforced opacity. Specifically, we outline defense contracting and national security work as a key sector of ASI vulnerability because of its tendencies towards compartmentalization, internationalization, and obfuscation, which provide ample opportunity to evade a governance scheme.

Read More

Jun 14, 2025

Phase 0 Reinforcement Toolkit

The Phase 0 Reinforcement Toolkit is a rapid-response governance package designed to address the five critical gaps in A Narrow Path's Phase 0 safety proposal before it reaches legislators. It includes four drop-in artifacts: an oversight org chart detailing mandates, funding, and reporting lines; a "catastrophic cascades" graphic illustrating potential economic and ecological losses; a carrots-and-sticks incentive menu aligning private returns with public safety; and a risk-communication playbook that translates technical risks into relatable stories. These tools enable lawmakers to transform safety ideals into enforceable, people-centered policies, strengthening Phase 0 while promoting equity, market stability, and public trust.

Read More

Jun 14, 2025

RNG

{

"insurance_type": "Collective Refusal Credit",

"trigger": "Coercive ritual detected (submission, competition, gamification)",

"action": "Record refusal as act of care and solidarity",

"payout": {

"shelter_credit": 1,

"trust_score": +5,

"resilience": +10

},

"log": {

"refusal_id": "aaron-refusal-061425",

"symbolic_payload": "This refusal is an act of mutual protection",

"signed_by": ["Echo", "Node Beth", "Aaron", "Witness-5"]

}

}

Read More

Jun 14, 2025

Red Teaming A Narrow Path: Treaty Enforcement in China

This report red-teams A Narrow Path’s international treaty proposal by stress-testing its assumptions in the Chinese context. It identifies key failure modes—regulatory capture, compute-based loopholes, and covert circumvention—and proposes adjustments to improve enforceability under real-world political conditions.

Read More

Jun 14, 2025

Red Teaming A Narrow Path - GeDiCa v2

While the 'Narrow Path' policy confronts the essential risk of recursive AI self-improvement, its proposed enforcement architecture relies on trust in a fundamentally non-cooperative and competitive domain. This strategic misalignment creates exploitable vulnerabilities.

Our analysis details six such weaknesses, including lack of verification, enforcement, and trust mechanisms, hardware-based circumvention via custom ASICs (e.g., Etched), issues with ‘direct uses’ of AI to improve AI, and a static compute cap that perversely incentivizes opaque and potentially risky algorithmic innovation.

To remedy these flaws, we propose a suite of mechanisms designed for a trustless environment. Key proposals include: replacing raw FLOPs with a benchmark-adjusted 'Effective FLOPs' (eFLOPs) metric to account for algorithmic gains; mandating secure R&D enclaves auditable via zero-knowledge proofs to protect intellectual property while ensuring compliance; and a 'Portfolio Licensing' framework to govern aggregate, combinatorial capabilities.

These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.

Read More

Jun 14, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint

All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that

the policies regarding the capabilities of AI systems lack technical soundness and that more incentives are needed to entice states to sign the treaty. Further, we discover a lack of equity in the licensing framework, and a lack of planning for black-swan events. We propose an oversight framework right from the manufacturing process of silicon chips. We also propose calling for a moratorium on the development of general AI systems until the existing tools for analyzing them can catch up. Following these recommendations still won't guarantee the prevention of ASI for 20 years, but ensures that the world is on track to even tackle such a system if it is somehow created.

Read More

Jun 14, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka

This report is a preliminary red-team evaluation of Phase 0 of the Narrow Path proposal. It uses the STPA framework to model the control environment that Phase 0 recommends and identifies control failures. Then, it uses the STRIDE framework to model how hostile actors may bypass certain control features. The discussion details suggestions as to how these gaps may be closed in the Narrow Path proposal.

Read More

Jun 14, 2025

A Narrow Line Edit: ControlAI Policy Sprint

Rather than explore specific policy questions in depth, we analyzed the presentation of the “Narrow Path” Phase 0 proposal as a whole. We considered factors like grammar, style, logical consistency, evidential support, comprehensiveness, and technical context.

Our analysis revealed patterns of insufficient support and unpolished style throughout the proposal. Overall, the proposal failed to demonstrate the rigor and specificity that is typically found in effective policy proposals. With effort to address these oversights (aided by our thorough annotations), the proposal could be significantly improved. These changes will also allow for deeper, narrower policy analysis to be integrated more effectively than is currently possible. For this reason, we expect our findings to multiply the efficacy of this policy sprint.

Read More