APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights


GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Read More
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
Read More
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
Read More
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
Read More
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
Read More
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
Read More
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
Read More
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
Read More
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
Read More
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
Read More
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
Read More
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
Read More
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Read More
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Mar 18, 2025
AI Safety Escape Room
The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.
Track: Public Education Track
Read More
Read More
Read More
Mar 18, 2025
Attention Pattern Based Information Flow Visualization Tool
Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.
Read More
Read More
Read More
Mar 18, 2025
LLM Military Decision-Making Under Uncertainty: A Simulation Study
LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.
Read More
Read More
Read More
Mar 18, 2025
Inspiring People to Go into RL Interp
This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.
Read More
Read More
Read More
Mar 18, 2025
Morph: AI Safety Education Adaptable to (Almost) Anyone
One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.
AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.
Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.
Read More
Read More
Read More
Mar 18, 2025
Interactive Assessments for AI Safety: A Gamified Approach to Evaluation and Personal Journey Mapping
An interactive assessment platform and mentor chatbot hosted on Canvas LMS, for testing and guiding learners from BlueDot's Intro to Transformative AI Course.
Read More
Read More
Read More
Mar 18, 2025
Mechanistic Interpretability Track: Neuronal Pathway Coverage
Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.
Read More
Read More
Read More
Mar 18, 2025
Preparing for Accelerated AGI Timelines
This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.
Read More
Read More
Read More
Mar 18, 2025
Identification if AI generated content
Our project falls within the Social Sciences track, focusing on the identification of AI-generated text content and its societal impact. A significant portion of online content is now AI-generated, often exhibiting a level of quality and human-likeness that makes it indistinguishable from human-created content. This raises concerns regarding misinformation, authorship transparency, and trust in digital communication.
Read More
Read More
Read More
Apart Sprint Pilot Experiments
Mar 18, 2025
AI Safety Escape Room
The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.
Track: Public Education Track
Read More
Mar 18, 2025
Attention Pattern Based Information Flow Visualization Tool
Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.
Read More
Mar 18, 2025
LLM Military Decision-Making Under Uncertainty: A Simulation Study
LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.
Read More
Mar 18, 2025
Inspiring People to Go into RL Interp
This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.
Read More
Mar 18, 2025
Morph: AI Safety Education Adaptable to (Almost) Anyone
One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.
AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.
Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.
Read More
Mar 18, 2025
Interactive Assessments for AI Safety: A Gamified Approach to Evaluation and Personal Journey Mapping
An interactive assessment platform and mentor chatbot hosted on Canvas LMS, for testing and guiding learners from BlueDot's Intro to Transformative AI Course.
Read More
Mar 18, 2025
Mechanistic Interpretability Track: Neuronal Pathway Coverage
Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.
Read More
Mar 18, 2025
Preparing for Accelerated AGI Timelines
This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.
Read More
Mar 18, 2025
Identification if AI generated content
Our project falls within the Social Sciences track, focusing on the identification of AI-generated text content and its societal impact. A significant portion of online content is now AI-generated, often exhibiting a level of quality and human-likeness that makes it indistinguishable from human-created content. This raises concerns regarding misinformation, authorship transparency, and trust in digital communication.
Read More
Community
Mar 18, 2025
Mapping AI Safety Research: An Open-Source Knowledge Graph
A tool to map the sprawling landscape of AI alignment research
Read More


Community
Mar 14, 2025
Apart News: San Francisco Edition
This week we have been in San Francisco for our Apart Retreat, where we attended conferences, saw old friends, and visited other AI labs to talk about frontier AI.
Read More


Community
Feb 21, 2025
Apart News: ICLR Awards & Women in AI Safety
This week, we celebrate ICLR conference oral awards for two of our papers, launch our Women in AI Safety hackathon, and more.
Read More


Community
Mar 18, 2025
Mapping AI Safety Research: An Open-Source Knowledge Graph
A tool to map the sprawling landscape of AI alignment research
Read More


Community
Mar 14, 2025
Apart News: San Francisco Edition
This week we have been in San Francisco for our Apart Retreat, where we attended conferences, saw old friends, and visited other AI labs to talk about frontier AI.
Read More


Community
Mar 18, 2025
Mapping AI Safety Research: An Open-Source Knowledge Graph
A tool to map the sprawling landscape of AI alignment research
Read More


Community
Mar 14, 2025
Apart News: San Francisco Edition
This week we have been in San Francisco for our Apart Retreat, where we attended conferences, saw old friends, and visited other AI labs to talk about frontier AI.
Read More


Our Impact
Community
Mar 18, 2025
Mapping AI Safety Research: An Open-Source Knowledge Graph
A tool to map the sprawling landscape of AI alignment research
Read More


Community
Mar 14, 2025
Apart News: San Francisco Edition
This week we have been in San Francisco for our Apart Retreat, where we attended conferences, saw old friends, and visited other AI labs to talk about frontier AI.
Read More


Community
Feb 21, 2025
Apart News: ICLR Awards & Women in AI Safety
This week, we celebrate ICLR conference oral awards for two of our papers, launch our Women in AI Safety hackathon, and more.
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events