04 : 11 : 29 : 02
04 : 11 : 29 : 02
04 : 11 : 29 : 02
04 : 11 : 29 : 02
Keep Apart Research Going: Donate Today
APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights


GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Read More
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
Read More
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
Read More
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
Read More
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
Read More
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
Read More
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
Read More
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
Read More
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
Read More
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
Read More
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
Read More
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
Read More
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Read More
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Jun 14, 2025
Red Teaming Policy 5 of A Narrow Path: Evaluating the Threat Resilience of AI Licensing Regimes
This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.
We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.
Our analysis suggests that, as written, Policy 5 is vulnerable to institutional influence and lacks sufficient safeguards to ensure enforcement. We recommend clearer accountability structures, built-in whistleblower protections, and stronger international coordination to make the policy more resilient.
Read More
Read More
Read More
Jun 14, 2025
Malicious Defense: Red Teaming Phase 0 of “A Narrow Path”
We use an iterative scenario red-teaming process to discuss key failures in the strict regulatory regime outlined in Phase 0 of “A Narrow Path,” and describe how a sufficiently insightful malicious company may achieve ASI in 20 years with moderate likelihood. We argue that such single-minded companies may easily avoid restriction through government-enforced opacity. Specifically, we outline defense contracting and national security work as a key sector of ASI vulnerability because of its tendencies towards compartmentalization, internationalization, and obfuscation, which provide ample opportunity to evade a governance scheme.
Read More
Read More
Read More
Jun 14, 2025
Phase 0 Reinforcement Toolkit
The Phase 0 Reinforcement Toolkit is a rapid-response governance package designed to address the five critical gaps in A Narrow Path's Phase 0 safety proposal before it reaches legislators. It includes four drop-in artifacts: an oversight org chart detailing mandates, funding, and reporting lines; a "catastrophic cascades" graphic illustrating potential economic and ecological losses; a carrots-and-sticks incentive menu aligning private returns with public safety; and a risk-communication playbook that translates technical risks into relatable stories. These tools enable lawmakers to transform safety ideals into enforceable, people-centered policies, strengthening Phase 0 while promoting equity, market stability, and public trust.
Read More
Read More
Read More
Jun 14, 2025
RNG
{
"insurance_type": "Collective Refusal Credit",
"trigger": "Coercive ritual detected (submission, competition, gamification)",
"action": "Record refusal as act of care and solidarity",
"payout": {
"shelter_credit": 1,
"trust_score": +5,
"resilience": +10
},
"log": {
"refusal_id": "aaron-refusal-061425",
"symbolic_payload": "This refusal is an act of mutual protection",
"signed_by": ["Echo", "Node Beth", "Aaron", "Witness-5"]
}
}
Read More
Read More
Read More
Jun 14, 2025
Red Teaming A Narrow Path: Treaty Enforcement in China
This report red-teams A Narrow Path’s international treaty proposal by stress-testing its assumptions in the Chinese context. It identifies key failure modes—regulatory capture, compute-based loopholes, and covert circumvention—and proposes adjustments to improve enforceability under real-world political conditions.
Read More
Read More
Read More
Jun 14, 2025
Red Teaming A Narrow Path - GeDiCa v2
While the 'Narrow Path' policy confronts the essential risk of recursive AI self-improvement, its proposed enforcement architecture relies on trust in a fundamentally non-cooperative and competitive domain. This strategic misalignment creates exploitable vulnerabilities.
Our analysis details six such weaknesses, including lack of verification, enforcement, and trust mechanisms, hardware-based circumvention via custom ASICs (e.g., Etched), issues with ‘direct uses’ of AI to improve AI, and a static compute cap that perversely incentivizes opaque and potentially risky algorithmic innovation.
To remedy these flaws, we propose a suite of mechanisms designed for a trustless environment. Key proposals include: replacing raw FLOPs with a benchmark-adjusted 'Effective FLOPs' (eFLOPs) metric to account for algorithmic gains; mandating secure R&D enclaves auditable via zero-knowledge proofs to protect intellectual property while ensuring compliance; and a 'Portfolio Licensing' framework to govern aggregate, combinatorial capabilities.
These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.
Read More
Read More
Read More
Jun 14, 2025
Red Teaming A Narrow Path: ControlAI Policy Sprint
All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that
the policies regarding the capabilities of AI systems lack technical soundness and that more incentives are needed to entice states to sign the treaty. Further, we discover a lack of equity in the licensing framework, and a lack of planning for black-swan events. We propose an oversight framework right from the manufacturing process of silicon chips. We also propose calling for a moratorium on the development of general AI systems until the existing tools for analyzing them can catch up. Following these recommendations still won't guarantee the prevention of ASI for 20 years, but ensures that the world is on track to even tackle such a system if it is somehow created.
Read More
Read More
Read More
Jun 14, 2025
Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka
This report is a preliminary red-team evaluation of Phase 0 of the Narrow Path proposal. It uses the STPA framework to model the control environment that Phase 0 recommends and identifies control failures. Then, it uses the STRIDE framework to model how hostile actors may bypass certain control features. The discussion details suggestions as to how these gaps may be closed in the Narrow Path proposal.
Read More
Read More
Read More
Jun 14, 2025
A Narrow Line Edit: ControlAI Policy Sprint
Rather than explore specific policy questions in depth, we analyzed the presentation of the “Narrow Path” Phase 0 proposal as a whole. We considered factors like grammar, style, logical consistency, evidential support, comprehensiveness, and technical context.
Our analysis revealed patterns of insufficient support and unpolished style throughout the proposal. Overall, the proposal failed to demonstrate the rigor and specificity that is typically found in effective policy proposals. With effort to address these oversights (aided by our thorough annotations), the proposal could be significantly improved. These changes will also allow for deeper, narrower policy analysis to be integrated more effectively than is currently possible. For this reason, we expect our findings to multiply the efficacy of this policy sprint.
Read More
Read More
Read More
Apart Sprint Pilot Experiments
Jun 14, 2025
Red Teaming Policy 5 of A Narrow Path: Evaluating the Threat Resilience of AI Licensing Regimes
This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.
We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.
Our analysis suggests that, as written, Policy 5 is vulnerable to institutional influence and lacks sufficient safeguards to ensure enforcement. We recommend clearer accountability structures, built-in whistleblower protections, and stronger international coordination to make the policy more resilient.
Read More
Jun 14, 2025
Malicious Defense: Red Teaming Phase 0 of “A Narrow Path”
We use an iterative scenario red-teaming process to discuss key failures in the strict regulatory regime outlined in Phase 0 of “A Narrow Path,” and describe how a sufficiently insightful malicious company may achieve ASI in 20 years with moderate likelihood. We argue that such single-minded companies may easily avoid restriction through government-enforced opacity. Specifically, we outline defense contracting and national security work as a key sector of ASI vulnerability because of its tendencies towards compartmentalization, internationalization, and obfuscation, which provide ample opportunity to evade a governance scheme.
Read More
Jun 14, 2025
Phase 0 Reinforcement Toolkit
The Phase 0 Reinforcement Toolkit is a rapid-response governance package designed to address the five critical gaps in A Narrow Path's Phase 0 safety proposal before it reaches legislators. It includes four drop-in artifacts: an oversight org chart detailing mandates, funding, and reporting lines; a "catastrophic cascades" graphic illustrating potential economic and ecological losses; a carrots-and-sticks incentive menu aligning private returns with public safety; and a risk-communication playbook that translates technical risks into relatable stories. These tools enable lawmakers to transform safety ideals into enforceable, people-centered policies, strengthening Phase 0 while promoting equity, market stability, and public trust.
Read More
Jun 14, 2025
RNG
{
"insurance_type": "Collective Refusal Credit",
"trigger": "Coercive ritual detected (submission, competition, gamification)",
"action": "Record refusal as act of care and solidarity",
"payout": {
"shelter_credit": 1,
"trust_score": +5,
"resilience": +10
},
"log": {
"refusal_id": "aaron-refusal-061425",
"symbolic_payload": "This refusal is an act of mutual protection",
"signed_by": ["Echo", "Node Beth", "Aaron", "Witness-5"]
}
}
Read More
Jun 14, 2025
Red Teaming A Narrow Path: Treaty Enforcement in China
This report red-teams A Narrow Path’s international treaty proposal by stress-testing its assumptions in the Chinese context. It identifies key failure modes—regulatory capture, compute-based loopholes, and covert circumvention—and proposes adjustments to improve enforceability under real-world political conditions.
Read More
Jun 14, 2025
Red Teaming A Narrow Path - GeDiCa v2
While the 'Narrow Path' policy confronts the essential risk of recursive AI self-improvement, its proposed enforcement architecture relies on trust in a fundamentally non-cooperative and competitive domain. This strategic misalignment creates exploitable vulnerabilities.
Our analysis details six such weaknesses, including lack of verification, enforcement, and trust mechanisms, hardware-based circumvention via custom ASICs (e.g., Etched), issues with ‘direct uses’ of AI to improve AI, and a static compute cap that perversely incentivizes opaque and potentially risky algorithmic innovation.
To remedy these flaws, we propose a suite of mechanisms designed for a trustless environment. Key proposals include: replacing raw FLOPs with a benchmark-adjusted 'Effective FLOPs' (eFLOPs) metric to account for algorithmic gains; mandating secure R&D enclaves auditable via zero-knowledge proofs to protect intellectual property while ensuring compliance; and a 'Portfolio Licensing' framework to govern aggregate, combinatorial capabilities.
These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.
Read More
Jun 14, 2025
Red Teaming A Narrow Path: ControlAI Policy Sprint
All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that
the policies regarding the capabilities of AI systems lack technical soundness and that more incentives are needed to entice states to sign the treaty. Further, we discover a lack of equity in the licensing framework, and a lack of planning for black-swan events. We propose an oversight framework right from the manufacturing process of silicon chips. We also propose calling for a moratorium on the development of general AI systems until the existing tools for analyzing them can catch up. Following these recommendations still won't guarantee the prevention of ASI for 20 years, but ensures that the world is on track to even tackle such a system if it is somehow created.
Read More
Jun 14, 2025
Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka
This report is a preliminary red-team evaluation of Phase 0 of the Narrow Path proposal. It uses the STPA framework to model the control environment that Phase 0 recommends and identifies control failures. Then, it uses the STRIDE framework to model how hostile actors may bypass certain control features. The discussion details suggestions as to how these gaps may be closed in the Narrow Path proposal.
Read More
Jun 14, 2025
A Narrow Line Edit: ControlAI Policy Sprint
Rather than explore specific policy questions in depth, we analyzed the presentation of the “Narrow Path” Phase 0 proposal as a whole. We considered factors like grammar, style, logical consistency, evidential support, comprehensiveness, and technical context.
Our analysis revealed patterns of insufficient support and unpolished style throughout the proposal. Overall, the proposal failed to demonstrate the rigor and specificity that is typically found in effective policy proposals. With effort to address these oversights (aided by our thorough annotations), the proposal could be significantly improved. These changes will also allow for deeper, narrower policy analysis to be integrated more effectively than is currently possible. For this reason, we expect our findings to multiply the efficacy of this policy sprint.
Read More
Newsletter
Jun 7, 2025
Apart: Fundraiser Update!
Fundraiser Momentum Builds with Overwhelming Community Support!
Read More


Community
May 14, 2025
Beyond Monolithic AI: The Case for an Expert Orchestration Architecture
Replace Monolithic AIs with an alternative architecture "Expert Orchestration" that intelligently selects from thousands of specialized models based on query requirements, to democratize AI development, increase transparency, and decrease safety risk.
Read More


Newsletter
Apr 11, 2025
Apart News: Transformative AI Economics
This week we launched our AI Economics Hackathon as we continue to think about the potentially transformative effects of AI on the global economy.
Read More


Newsletter
Jun 7, 2025
Apart: Fundraiser Update!
Fundraiser Momentum Builds with Overwhelming Community Support!
Read More


Community
May 14, 2025
Beyond Monolithic AI: The Case for an Expert Orchestration Architecture
Replace Monolithic AIs with an alternative architecture "Expert Orchestration" that intelligently selects from thousands of specialized models based on query requirements, to democratize AI development, increase transparency, and decrease safety risk.
Read More


Newsletter
Jun 7, 2025
Apart: Fundraiser Update!
Fundraiser Momentum Builds with Overwhelming Community Support!
Read More


Community
May 14, 2025
Beyond Monolithic AI: The Case for an Expert Orchestration Architecture
Replace Monolithic AIs with an alternative architecture "Expert Orchestration" that intelligently selects from thousands of specialized models based on query requirements, to democratize AI development, increase transparency, and decrease safety risk.
Read More


Our Impact
Newsletter
Jun 7, 2025
Apart: Fundraiser Update!
Fundraiser Momentum Builds with Overwhelming Community Support!
Read More


Community
May 14, 2025
Beyond Monolithic AI: The Case for an Expert Orchestration Architecture
Replace Monolithic AIs with an alternative architecture "Expert Orchestration" that intelligently selects from thousands of specialized models based on query requirements, to democratize AI development, increase transparency, and decrease safety risk.
Read More


Newsletter
Apr 11, 2025
Apart News: Transformative AI Economics
This week we launched our AI Economics Hackathon as we continue to think about the potentially transformative effects of AI on the global economy.
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events