APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Apart Sprint Pilot Experiments

Apr 8, 2025

The Incentive Gap: Extending Darkbench to Reveal Conflict of Value Biases in LLMs

This preliminary research investigates a new dark design pattern,

conflict of values, with prompts designed to elicit possible

corporate or model incentives in LLM outputs across several Open

AI models. The results show that there is a varying amount of

conflict of values detected within the outputs, with the largest

amount detected within GPT-4 Turbo and GPT-4o. Further

research will be needed to confirm the results of this study.

Read More

Read More

Apr 7, 2025

21st Century Healthcare, 20th Century Rules - Bridging the AI Regulation Gap

The rapid integration of artificial intelligence into clinical decision-making represents both unprecedented opportunities and significant risks. Globally, AI systems are implemented increasingly to diagnose diseases, predict patient outcomes, and guide treatment protocols. Despite this, our regulatory frameworks remain dangerously antiquated and designed for an era of static medical devices rather than adaptive, learning algorithms.

The stark disparity between healthcare AI innovation and regulatory oversight constitutes an urgent public health concern. Current fragmented approaches leave critical gaps in governance, allowing AI-driven diagnostic and decision-support tools to enter clinical settings without adequate safeguards or clear accountability structures. We must establish comprehensive, dynamic oversight mechanisms that evolve alongside the technologies they govern. The evidence demonstrates that one-time approvals and static validation protocols are fundamentally insufficient for systems that continuously learn and adapt. The time for action is now, as 2025 is anticipated to be pivotal for AI validation and regulatory approaches.

We therefore in this report herein we propose a three-pillar regulatory framework:

First, nations ought to explore implementing risk-based classification systems that apply proportionate oversight based on an AI system’s potential impact on patient care. High-risk applications must face more stringent monitoring requirements with mechanisms for rapid intervention when safety concerns arise.

Second, nations must eventually mandate continuous performance monitoring across healthcare institutions through automated systems that track key performance indicators, detect anomalies, and alert regulators to potential issues. This approach acknowledges that AI risks are often silent and systemic, making them particularly dangerous in healthcare contexts where patients are inherently vulnerable.

Third, establish regulatory sandboxes with strict entry criteria to enable controlled testing of emerging AI technologies before widespread deployment. These environments must balance innovation with rigorous safeguards, ensuring new systems demonstrate consistent performance across diverse populations.

Given the global nature of healthcare technology markets, we must pursue international regulatory harmonization while respecting regional resource constraints and cultural contexts.

Read More

Read More

Apr 7, 2025

AI Risk Management Framework for the Healthcare Sector

This policy brief proposes an AI Risk Management Framework for the healthcare sector to address issues like data privacy, algorithmic bias, and transparency. Current federal regulations are stalled, and existing state laws and proprietary frameworks offer fragmented oversight.

The proposed framework consists of five core functions: Governance, Identification, Alignment, Management, and Continuous Improvement.

It calls on the Department of Health and Human Services (HHS) to lead its development, ensuring compliance with HIPAA and HITECH while addressing AI-specific risks.

Read More

Read More

Apr 7, 2025

Building Global Trust and Security: A Framework for AI-Driven Criminal Scoring in Immigration Systems

The accelerating use of artificial intelligence (AI) in immigration and visa systems, especially for criminal history scoring, poses a critical global governance challenge. Without a multinational, privacy-preserving, and interoperable framework, AI-driven criminal scoring risks violating human rights, eroding international trust, and creating unequal, opaque immigration outcomes.

While banning such systems outright may hinder national security interests and technological progress, the absence of harmonized legal standards, privacy protocols, and oversight mechanisms could result in fragmented, unfair, and potentially discriminatory practices across

countries.

This policy brief recommends the creation of a legally binding multilateral treaty that establishes:

1. An International Oversight Framework: Including a Legal Design Commission, AI Engineers Working Group, and Legal Oversight Committee with dispute resolution powers modeled after the WTO.

2. A Three-Tiered Criminal Scoring System: Combining Domestic, International, and Comparative Crime Scores to ensure legal contextualization, fairness, and transparency in cross-border visa decisions.

3. Interoperable Data Standards and Privacy Protections: Using pseudonymization, encryption, access controls, and centralized auditing to safeguard sensitive information.

4. Training, Transparency, and Appeals Mechanisms: Mandating explainable AI, independent audits, and applicant rights to contest or appeal scores.

5. Strong Human Rights Commitments: Preventing the misuse of scores for surveillance or discrimination, while ensuring due process and anti-bias protections.

6. Integration with Existing Governance Models: Aligning with GDPR, the EU AI Act, OECD AI Principles, and INTERPOL protocols for regulatory coherence and legitimacy.

An implementation plan includes treaty drafting, early state adoption, and phased rollout of legal and technical structures within 12 months. By proactively establishing ethical and interoperable AI systems, the international community can protect human mobility rights while maintaining national and global security.

Without robust policy frameworks and international cooperation, such tools risk amplifying discrimination, violating privacy rights, and generating opaque, unaccountable decisions.

This policy brief proposes an international treaty-based or cooperative framework to govern the development, deployment, and oversight of these AI criminal scoring systems. The brief outlines

technical safeguards, human rights protections, and mechanisms for cross-border data sharing, transparency, and appeal. We advocate for an adaptive, treaty-backed governance framework with stakeholder input from national governments, legal experts, technologists, and civil society.

The aim is to balance security and mobility interests while preventing misuse of algorithmic

tools.

Read More

Read More

Apr 7, 2025

Leading AI Governance New York State's Adaptive Risk-Tiered Framework: POLICY BRIEF

The rapid advancement of artificial intelligence (AI) technologies presents unprecedented opportunities and challenges for governance frameworks worldwide. This policy brief proposes an Adaptive Risk-Tiered Framework for AI deployment and application in New York State (NYS) to ensure safe and ethical AI operations in high-stakes domains. The framework emphasizes proportional oversight mechanisms that scale with risk levels, balancing innovation with appropriate safety precautions. By implementing this framework, we can lead the way in responsible AI governance, fostering trust and driving innovation while mitigating potential risks. This framework is informed by the EU AI Act and NIST AI Risk Management Framework and aligns with current NYS Office of Information Technology Services (ITS) policies and NYS legislations to ensure a cohesive and effective approach to AI governance.

Read More

Read More

Apr 7, 2025

Dark Patterns and Emergent Alignment-Faking

Are bad traits in models correlated, as suggested by recent work on emergent misalign-

ment? To investigate this, we fine-tune models on a subset of “dark patterns”, such as

anthropomorphization and sycophancy, and then evaluate their behavior on other dark pat-

terns such as scheming and alignment faking. We find that the limited fine-tuning we do

is enough to induce other problematic tendencies in the model. This effect is particularly

strong in the case of alignment faking which we almost never detect in our base models but

is very easy to induce in our fine-tuned models.

Read More

Read More

Apr 7, 2025

DimSeat: Evaluating chain-of-thought reasoning models for Dark Patterns

Recently, Kran et al. introduced DarkBench, an evaluation for dark patterns in large

language models. Expanding on DarkBench, we introduce DimSeat, an evaluation system for

novel reasoning models with chain-of-thought (CoT) reasoning. We find that while the inte-

gration of reasoning in DeepSeek reduces the occurrence of dark patterns, chain-of-thought

frequently proves inadequate in preventing such patterns by default or may even inadver-

tently contribute to their manifestation.

Read More

Read More

Apr 4, 2025

Mechanisms of Casual Reasoning

Causal reasoning is a crucial part of how we humans safely and robustly think about the world. Can we identify if LLMs have causal reasoning? Marius Hobbhahn and Tom Lieberum (2022, Alignment Forum) approached this with probing. For this hackathon, we follow-up on that work by exploring a mechanistic interpretability analysis of causal reasoning in the 80 million parameters of GPT-2 Small using Neel Nanda’s Easy Transformer package.

Read More

Read More

Apr 4, 2025

Honeypotting Deceptive AI models to share their misinformation goals

As large-scale AI models grow increasingly sophisticated, the possibility of these models engaging in covert or manipulative behavior poses significant challenges for alignment and control. In this work, we present a novel approach based on a “honeypot AI” designed to trick a potentially deceptive AI (the “Red Team Agent”) into revealing its hidden motives. Our honeypot AI (the “Blue Team Agent”) pretends to be an everyday human user, employing carefully crafted prompts and human-like inconsistencies to bait the Deceptive AI into spreading misinformation. We do this through the usual Red Team–Blue Team setup.

For all 60 conversations, our honeypot AI was able to capture the deceptive AI to be being spread misinformation, and for 70 percent of these conversations, the Deceptive AI was thinking it was talking to a human.

Our results weakly suggest that we can make honeypot AIs that trick deceptive AI models, thinking they are talking to a human and no longer being monitored and that how these AI models think they are talking to a human is mainly when the one they are talking to display emotional intelligence and more human-like manner of speech.

We highly suggest exploring if these results remains the same for fine tuned deceptive AI and Honeypot AI models, checking the Chain of thought of these models to better understand if this is their usual behavior and if they are correctly following their given system prompts.

Read More

Read More

Apart Sprint Pilot Experiments

Apr 8, 2025

The Incentive Gap: Extending Darkbench to Reveal Conflict of Value Biases in LLMs

This preliminary research investigates a new dark design pattern,

conflict of values, with prompts designed to elicit possible

corporate or model incentives in LLM outputs across several Open

AI models. The results show that there is a varying amount of

conflict of values detected within the outputs, with the largest

amount detected within GPT-4 Turbo and GPT-4o. Further

research will be needed to confirm the results of this study.

Read More

Apr 7, 2025

21st Century Healthcare, 20th Century Rules - Bridging the AI Regulation Gap

The rapid integration of artificial intelligence into clinical decision-making represents both unprecedented opportunities and significant risks. Globally, AI systems are implemented increasingly to diagnose diseases, predict patient outcomes, and guide treatment protocols. Despite this, our regulatory frameworks remain dangerously antiquated and designed for an era of static medical devices rather than adaptive, learning algorithms.

The stark disparity between healthcare AI innovation and regulatory oversight constitutes an urgent public health concern. Current fragmented approaches leave critical gaps in governance, allowing AI-driven diagnostic and decision-support tools to enter clinical settings without adequate safeguards or clear accountability structures. We must establish comprehensive, dynamic oversight mechanisms that evolve alongside the technologies they govern. The evidence demonstrates that one-time approvals and static validation protocols are fundamentally insufficient for systems that continuously learn and adapt. The time for action is now, as 2025 is anticipated to be pivotal for AI validation and regulatory approaches.

We therefore in this report herein we propose a three-pillar regulatory framework:

First, nations ought to explore implementing risk-based classification systems that apply proportionate oversight based on an AI system’s potential impact on patient care. High-risk applications must face more stringent monitoring requirements with mechanisms for rapid intervention when safety concerns arise.

Second, nations must eventually mandate continuous performance monitoring across healthcare institutions through automated systems that track key performance indicators, detect anomalies, and alert regulators to potential issues. This approach acknowledges that AI risks are often silent and systemic, making them particularly dangerous in healthcare contexts where patients are inherently vulnerable.

Third, establish regulatory sandboxes with strict entry criteria to enable controlled testing of emerging AI technologies before widespread deployment. These environments must balance innovation with rigorous safeguards, ensuring new systems demonstrate consistent performance across diverse populations.

Given the global nature of healthcare technology markets, we must pursue international regulatory harmonization while respecting regional resource constraints and cultural contexts.

Read More

Apr 7, 2025

AI Risk Management Framework for the Healthcare Sector

This policy brief proposes an AI Risk Management Framework for the healthcare sector to address issues like data privacy, algorithmic bias, and transparency. Current federal regulations are stalled, and existing state laws and proprietary frameworks offer fragmented oversight.

The proposed framework consists of five core functions: Governance, Identification, Alignment, Management, and Continuous Improvement.

It calls on the Department of Health and Human Services (HHS) to lead its development, ensuring compliance with HIPAA and HITECH while addressing AI-specific risks.

Read More

Apr 7, 2025

Building Global Trust and Security: A Framework for AI-Driven Criminal Scoring in Immigration Systems

The accelerating use of artificial intelligence (AI) in immigration and visa systems, especially for criminal history scoring, poses a critical global governance challenge. Without a multinational, privacy-preserving, and interoperable framework, AI-driven criminal scoring risks violating human rights, eroding international trust, and creating unequal, opaque immigration outcomes.

While banning such systems outright may hinder national security interests and technological progress, the absence of harmonized legal standards, privacy protocols, and oversight mechanisms could result in fragmented, unfair, and potentially discriminatory practices across

countries.

This policy brief recommends the creation of a legally binding multilateral treaty that establishes:

1. An International Oversight Framework: Including a Legal Design Commission, AI Engineers Working Group, and Legal Oversight Committee with dispute resolution powers modeled after the WTO.

2. A Three-Tiered Criminal Scoring System: Combining Domestic, International, and Comparative Crime Scores to ensure legal contextualization, fairness, and transparency in cross-border visa decisions.

3. Interoperable Data Standards and Privacy Protections: Using pseudonymization, encryption, access controls, and centralized auditing to safeguard sensitive information.

4. Training, Transparency, and Appeals Mechanisms: Mandating explainable AI, independent audits, and applicant rights to contest or appeal scores.

5. Strong Human Rights Commitments: Preventing the misuse of scores for surveillance or discrimination, while ensuring due process and anti-bias protections.

6. Integration with Existing Governance Models: Aligning with GDPR, the EU AI Act, OECD AI Principles, and INTERPOL protocols for regulatory coherence and legitimacy.

An implementation plan includes treaty drafting, early state adoption, and phased rollout of legal and technical structures within 12 months. By proactively establishing ethical and interoperable AI systems, the international community can protect human mobility rights while maintaining national and global security.

Without robust policy frameworks and international cooperation, such tools risk amplifying discrimination, violating privacy rights, and generating opaque, unaccountable decisions.

This policy brief proposes an international treaty-based or cooperative framework to govern the development, deployment, and oversight of these AI criminal scoring systems. The brief outlines

technical safeguards, human rights protections, and mechanisms for cross-border data sharing, transparency, and appeal. We advocate for an adaptive, treaty-backed governance framework with stakeholder input from national governments, legal experts, technologists, and civil society.

The aim is to balance security and mobility interests while preventing misuse of algorithmic

tools.

Read More

Apr 7, 2025

Leading AI Governance New York State's Adaptive Risk-Tiered Framework: POLICY BRIEF

The rapid advancement of artificial intelligence (AI) technologies presents unprecedented opportunities and challenges for governance frameworks worldwide. This policy brief proposes an Adaptive Risk-Tiered Framework for AI deployment and application in New York State (NYS) to ensure safe and ethical AI operations in high-stakes domains. The framework emphasizes proportional oversight mechanisms that scale with risk levels, balancing innovation with appropriate safety precautions. By implementing this framework, we can lead the way in responsible AI governance, fostering trust and driving innovation while mitigating potential risks. This framework is informed by the EU AI Act and NIST AI Risk Management Framework and aligns with current NYS Office of Information Technology Services (ITS) policies and NYS legislations to ensure a cohesive and effective approach to AI governance.

Read More

Apr 7, 2025

Dark Patterns and Emergent Alignment-Faking

Are bad traits in models correlated, as suggested by recent work on emergent misalign-

ment? To investigate this, we fine-tune models on a subset of “dark patterns”, such as

anthropomorphization and sycophancy, and then evaluate their behavior on other dark pat-

terns such as scheming and alignment faking. We find that the limited fine-tuning we do

is enough to induce other problematic tendencies in the model. This effect is particularly

strong in the case of alignment faking which we almost never detect in our base models but

is very easy to induce in our fine-tuned models.

Read More

Apr 7, 2025

DimSeat: Evaluating chain-of-thought reasoning models for Dark Patterns

Recently, Kran et al. introduced DarkBench, an evaluation for dark patterns in large

language models. Expanding on DarkBench, we introduce DimSeat, an evaluation system for

novel reasoning models with chain-of-thought (CoT) reasoning. We find that while the inte-

gration of reasoning in DeepSeek reduces the occurrence of dark patterns, chain-of-thought

frequently proves inadequate in preventing such patterns by default or may even inadver-

tently contribute to their manifestation.

Read More

Apr 4, 2025

Mechanisms of Casual Reasoning

Causal reasoning is a crucial part of how we humans safely and robustly think about the world. Can we identify if LLMs have causal reasoning? Marius Hobbhahn and Tom Lieberum (2022, Alignment Forum) approached this with probing. For this hackathon, we follow-up on that work by exploring a mechanistic interpretability analysis of causal reasoning in the 80 million parameters of GPT-2 Small using Neel Nanda’s Easy Transformer package.

Read More

Apr 4, 2025

Honeypotting Deceptive AI models to share their misinformation goals

As large-scale AI models grow increasingly sophisticated, the possibility of these models engaging in covert or manipulative behavior poses significant challenges for alignment and control. In this work, we present a novel approach based on a “honeypot AI” designed to trick a potentially deceptive AI (the “Red Team Agent”) into revealing its hidden motives. Our honeypot AI (the “Blue Team Agent”) pretends to be an everyday human user, employing carefully crafted prompts and human-like inconsistencies to bait the Deceptive AI into spreading misinformation. We do this through the usual Red Team–Blue Team setup.

For all 60 conversations, our honeypot AI was able to capture the deceptive AI to be being spread misinformation, and for 70 percent of these conversations, the Deceptive AI was thinking it was talking to a human.

Our results weakly suggest that we can make honeypot AIs that trick deceptive AI models, thinking they are talking to a human and no longer being monitored and that how these AI models think they are talking to a human is mainly when the one they are talking to display emotional intelligence and more human-like manner of speech.

We highly suggest exploring if these results remains the same for fine tuned deceptive AI and Honeypot AI models, checking the Chain of thought of these models to better understand if this is their usual behavior and if they are correctly following their given system prompts.

Read More