APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Apart Sprint Pilot Experiments

Jun 26, 2025

JASONTEST Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement

Guardian-Loop is a mechanistically interpretable judge system designed to enhance the Expert Orchestration Architecture through transparent and efficient safety evaluation. Targeting Track 1 (Judge Model Development), we train lightweight classifiers that pre-filter prompts for safety using a Llama 3.1 8B model, fine-tuning only the upper layers to directly output True or False responses. This avoids probe-head architectures, enabling native token-level interpretability and calibrated scoring. Achieving 85.0% accuracy and 94.6% AUC-ROC on a hold-out test set with low latency using the safety judge, the system is deployable on consumer hardware. Guardian-Loop integrates deep interpretability techniques, including token attribution, attention analysis, and circuit tracing, to expose the model’s internal decision-making; We also demonstrate the extensibility of our framework by applying it to adjacent judgment tasks, such as feasibility prediction. An open-ended adversarial framework based on MAP-Elites quality diversity optimization was proposed, designed to populate a 10×10 grid spanning risk types and evasion strategies. While not yet deployed, this framework could support continuous self-improvement and vulnerability discovery. Guardian-Loop illustrates how small-sized LLMs can be repurposed as efficient, transparent filters, supporting scalable and trustworthy AI deployments.

Read More

Read More

Jun 27, 2025

Red Teaming Policy 5 of A Narrow Path: Evaluating the Threat Resilience of AI Licensing Regimes

This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.

We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.

Our analysis suggests that, as written, Policy 5 is vulnerable to institutional influence and lacks sufficient safeguards to ensure enforcement. We recommend clearer accountability structures, built-in whistleblower protections, and stronger international coordination to make the policy more resilient.

Read More

Read More

Jun 27, 2025

Malicious Defense: Red Teaming Phase 0 of “A Narrow Path”

We use an iterative scenario red-teaming process to discuss key failures in the strict regulatory regime outlined in Phase 0 of “A Narrow Path,” and describe how a sufficiently insightful malicious company may achieve ASI in 20 years with moderate likelihood. We argue that such single-minded companies may easily avoid restriction through government-enforced opacity. Specifically, we outline defense contracting and national security work as a key sector of ASI vulnerability because of its tendencies towards compartmentalization, internationalization, and obfuscation, which provide ample opportunity to evade a governance scheme.

Read More

Read More

Jun 27, 2025

Phase 0 Reinforcement Toolkit

The Phase 0 Reinforcement Toolkit is a rapid-response governance package designed to address the five critical gaps in A Narrow Path's Phase 0 safety proposal before it reaches legislators. It includes four drop-in artifacts: an oversight org chart detailing mandates, funding, and reporting lines; a "catastrophic cascades" graphic illustrating potential economic and ecological losses; a carrots-and-sticks incentive menu aligning private returns with public safety; and a risk-communication playbook that translates technical risks into relatable stories. These tools enable lawmakers to transform safety ideals into enforceable, people-centered policies, strengthening Phase 0 while promoting equity, market stability, and public trust.

Read More

Read More

Jun 30, 2025

Red Teaming A Narrow Path: Treaty Enforcement in China

This report red-teams A Narrow Path’s international treaty proposal by stress-testing its assumptions in the Chinese context. It identifies key failure modes—regulatory capture, compute-based loopholes, and covert circumvention—and proposes adjustments to improve enforceability under real-world political conditions.

Read More

Read More

Jun 27, 2025

Red Teaming A Narrow Path - GeDiCa v2

While the 'Narrow Path' policy confronts the essential risk of recursive AI self-improvement, its proposed enforcement architecture relies on trust in a fundamentally non-cooperative and competitive domain. This strategic misalignment creates exploitable vulnerabilities.

Our analysis details six such weaknesses, including lack of verification, enforcement, and trust mechanisms, hardware-based circumvention via custom ASICs (e.g., Etched), issues with ‘direct uses’ of AI to improve AI, and a static compute cap that perversely incentivizes opaque and potentially risky algorithmic innovation.

To remedy these flaws, we propose a suite of mechanisms designed for a trustless environment. Key proposals include: replacing raw FLOPs with a benchmark-adjusted 'Effective FLOPs' (eFLOPs) metric to account for algorithmic gains; mandating secure R&D enclaves auditable via zero-knowledge proofs to protect intellectual property while ensuring compliance; and a 'Portfolio Licensing' framework to govern aggregate, combinatorial capabilities.

These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.

Read More

Read More

Jun 30, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint

All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that

the policies regarding the capabilities of AI systems lack technical soundness and that more incentives are needed to entice states to sign the treaty. Further, we discover a lack of equity in the licensing framework, and a lack of planning for black-swan events. We propose an oversight framework right from the manufacturing process of silicon chips. We also propose calling for a moratorium on the development of general AI systems until the existing tools for analyzing them can catch up. Following these recommendations still won't guarantee the prevention of ASI for 20 years, but ensures that the world is on track to even tackle such a system if it is somehow created.

Read More

Read More

Jun 27, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka

This report is a preliminary red-team evaluation of Phase 0 of the Narrow Path proposal. It uses the STPA framework to model the control environment that Phase 0 recommends and identifies control failures. Then, it uses the STRIDE framework to model how hostile actors may bypass certain control features. The discussion details suggestions as to how these gaps may be closed in the Narrow Path proposal.

Read More

Read More

Jun 27, 2025

A Narrow Line Edit: ControlAI Policy Sprint

Rather than explore specific policy questions in depth, we analyzed the presentation of the “Narrow Path” Phase 0 proposal as a whole. We considered factors like grammar, style, logical consistency, evidential support, comprehensiveness, and technical context.

Our analysis revealed patterns of insufficient support and unpolished style throughout the proposal. Overall, the proposal failed to demonstrate the rigor and specificity that is typically found in effective policy proposals. With effort to address these oversights (aided by our thorough annotations), the proposal could be significantly improved. These changes will also allow for deeper, narrower policy analysis to be integrated more effectively than is currently possible. For this reason, we expect our findings to multiply the efficacy of this policy sprint.

Read More

Read More

Apart Sprint Pilot Experiments

Jun 26, 2025

JASONTEST Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement

Guardian-Loop is a mechanistically interpretable judge system designed to enhance the Expert Orchestration Architecture through transparent and efficient safety evaluation. Targeting Track 1 (Judge Model Development), we train lightweight classifiers that pre-filter prompts for safety using a Llama 3.1 8B model, fine-tuning only the upper layers to directly output True or False responses. This avoids probe-head architectures, enabling native token-level interpretability and calibrated scoring. Achieving 85.0% accuracy and 94.6% AUC-ROC on a hold-out test set with low latency using the safety judge, the system is deployable on consumer hardware. Guardian-Loop integrates deep interpretability techniques, including token attribution, attention analysis, and circuit tracing, to expose the model’s internal decision-making; We also demonstrate the extensibility of our framework by applying it to adjacent judgment tasks, such as feasibility prediction. An open-ended adversarial framework based on MAP-Elites quality diversity optimization was proposed, designed to populate a 10×10 grid spanning risk types and evasion strategies. While not yet deployed, this framework could support continuous self-improvement and vulnerability discovery. Guardian-Loop illustrates how small-sized LLMs can be repurposed as efficient, transparent filters, supporting scalable and trustworthy AI deployments.

Read More

Jun 27, 2025

Red Teaming Policy 5 of A Narrow Path: Evaluating the Threat Resilience of AI Licensing Regimes

This report presents a red teaming analysis of Policy 5 from A Narrow Path, ControlAI’s proposal to delay Artificial Superintelligence (ASI) development through national AI licensing. Using a simplified PASTA threat modeling approach and comparative case studies (FDA, INCB, and California SB 1047), we identified two critical failure modes: regulatory capture and lack of whistleblower protections.

We developed a custom policy CVSS framework to assess cumulative risk exposure across each case. Due to time constraints, we used ChatGPT-assisted simulation to complete the results section and illustrate potential findings from our scoring method.

Our analysis suggests that, as written, Policy 5 is vulnerable to institutional influence and lacks sufficient safeguards to ensure enforcement. We recommend clearer accountability structures, built-in whistleblower protections, and stronger international coordination to make the policy more resilient.

Read More

Jun 27, 2025

Malicious Defense: Red Teaming Phase 0 of “A Narrow Path”

We use an iterative scenario red-teaming process to discuss key failures in the strict regulatory regime outlined in Phase 0 of “A Narrow Path,” and describe how a sufficiently insightful malicious company may achieve ASI in 20 years with moderate likelihood. We argue that such single-minded companies may easily avoid restriction through government-enforced opacity. Specifically, we outline defense contracting and national security work as a key sector of ASI vulnerability because of its tendencies towards compartmentalization, internationalization, and obfuscation, which provide ample opportunity to evade a governance scheme.

Read More

Jun 27, 2025

Phase 0 Reinforcement Toolkit

The Phase 0 Reinforcement Toolkit is a rapid-response governance package designed to address the five critical gaps in A Narrow Path's Phase 0 safety proposal before it reaches legislators. It includes four drop-in artifacts: an oversight org chart detailing mandates, funding, and reporting lines; a "catastrophic cascades" graphic illustrating potential economic and ecological losses; a carrots-and-sticks incentive menu aligning private returns with public safety; and a risk-communication playbook that translates technical risks into relatable stories. These tools enable lawmakers to transform safety ideals into enforceable, people-centered policies, strengthening Phase 0 while promoting equity, market stability, and public trust.

Read More

Jun 30, 2025

Red Teaming A Narrow Path: Treaty Enforcement in China

This report red-teams A Narrow Path’s international treaty proposal by stress-testing its assumptions in the Chinese context. It identifies key failure modes—regulatory capture, compute-based loopholes, and covert circumvention—and proposes adjustments to improve enforceability under real-world political conditions.

Read More

Jun 27, 2025

Red Teaming A Narrow Path - GeDiCa v2

While the 'Narrow Path' policy confronts the essential risk of recursive AI self-improvement, its proposed enforcement architecture relies on trust in a fundamentally non-cooperative and competitive domain. This strategic misalignment creates exploitable vulnerabilities.

Our analysis details six such weaknesses, including lack of verification, enforcement, and trust mechanisms, hardware-based circumvention via custom ASICs (e.g., Etched), issues with ‘direct uses’ of AI to improve AI, and a static compute cap that perversely incentivizes opaque and potentially risky algorithmic innovation.

To remedy these flaws, we propose a suite of mechanisms designed for a trustless environment. Key proposals include: replacing raw FLOPs with a benchmark-adjusted 'Effective FLOPs' (eFLOPs) metric to account for algorithmic gains; mandating secure R&D enclaves auditable via zero-knowledge proofs to protect intellectual property while ensuring compliance; and a 'Portfolio Licensing' framework to govern aggregate, combinatorial capabilities.

These solutions aim to participate in the effort to transform the policy's intent into a more robust, technically-grounded, and enforceable standard.

Read More

Jun 30, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint

All six policies are red teamed step-by-step systematically. We initially corrected vague definitions and also found that

the policies regarding the capabilities of AI systems lack technical soundness and that more incentives are needed to entice states to sign the treaty. Further, we discover a lack of equity in the licensing framework, and a lack of planning for black-swan events. We propose an oversight framework right from the manufacturing process of silicon chips. We also propose calling for a moratorium on the development of general AI systems until the existing tools for analyzing them can catch up. Following these recommendations still won't guarantee the prevention of ASI for 20 years, but ensures that the world is on track to even tackle such a system if it is somehow created.

Read More

Jun 27, 2025

Red Teaming A Narrow Path: ControlAI Policy Sprint by Aryan Goenka

This report is a preliminary red-team evaluation of Phase 0 of the Narrow Path proposal. It uses the STPA framework to model the control environment that Phase 0 recommends and identifies control failures. Then, it uses the STRIDE framework to model how hostile actors may bypass certain control features. The discussion details suggestions as to how these gaps may be closed in the Narrow Path proposal.

Read More

Jun 27, 2025

A Narrow Line Edit: ControlAI Policy Sprint

Rather than explore specific policy questions in depth, we analyzed the presentation of the “Narrow Path” Phase 0 proposal as a whole. We considered factors like grammar, style, logical consistency, evidential support, comprehensiveness, and technical context.

Our analysis revealed patterns of insufficient support and unpolished style throughout the proposal. Overall, the proposal failed to demonstrate the rigor and specificity that is typically found in effective policy proposals. With effort to address these oversights (aided by our thorough annotations), the proposal could be significantly improved. These changes will also allow for deeper, narrower policy analysis to be integrated more effectively than is currently possible. For this reason, we expect our findings to multiply the efficacy of this policy sprint.

Read More