APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Research Index

NOV 18, 2024

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

Read More

Read More

Read More

NOV 2, 2024

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Read More

Read More

Read More

oct 18, 2024

benchmarks

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Read More

Read More

Read More

sep 25, 2024

Interpretability

Interpreting Learned Feedback Patterns in Large Language Models

Read More

Read More

Read More

feb 23, 2024

Interpretability

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Read More

Read More

Read More

feb 4, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits

Read More

Read More

Read More

jan 14, 2024

conceptual

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Read More

Read More

Read More

jan 3, 2024

Interpretability

Large Language Models Relearn Removed Concepts

Read More

Read More

Read More

nov 28, 2023

Interpretability

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

Read More

Read More

Read More

nov 23, 2023

Interpretability

Understanding addition in transformers

Read More

Read More

Read More

nov 7, 2023

Interpretability

Locating cross-task sequence continuation circuits in transformers

Read More

Read More

Read More

jul 10, 2023

benchmarks

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Read More

Read More

Read More

may 5, 2023

Interpretability

Interpreting language model neurons at scale

Read More

Read More

Read More

Apart Sprint Pilot Experiments

Mar 18, 2025

AI Safety Escape Room

The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.

Track: Public Education Track

Read More

Read More

Read More

Mar 18, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Read More

Read More

Mar 18, 2025

LLM Military Decision-Making Under Uncertainty: A Simulation Study

LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.

Read More

Read More

Read More

Mar 18, 2025

Inspiring People to Go into RL Interp

This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.

Read More

Read More

Read More

Mar 18, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Read More

Read More

Mar 18, 2025

Interactive Assessments for AI Safety: A Gamified Approach to Evaluation and Personal Journey Mapping

An interactive assessment platform and mentor chatbot hosted on Canvas LMS, for testing and guiding learners from BlueDot's Intro to Transformative AI Course.

Read More

Read More

Read More

Mar 18, 2025

Mechanistic Interpretability Track: Neuronal Pathway Coverage

Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.

Read More

Read More

Read More

Mar 18, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Read More

Read More

Mar 18, 2025

Identification if AI generated content

Our project falls within the Social Sciences track, focusing on the identification of AI-generated text content and its societal impact. A significant portion of online content is now AI-generated, often exhibiting a level of quality and human-likeness that makes it indistinguishable from human-created content. This raises concerns regarding misinformation, authorship transparency, and trust in digital communication.

Read More

Read More

Read More

Apart Sprint Pilot Experiments

Mar 18, 2025

AI Safety Escape Room

The AI Safety Escape Room is an engaging and hands-on AI safety simulation where participants solve real-world AI vulnerabilities through interactive challenges. Instead of learning AI safety through theory, users experience it firsthand – debugging models, detecting adversarial attacks, and refining AI fairness, all within a fun, gamified environment.

Track: Public Education Track

Read More

Mar 18, 2025

Attention Pattern Based Information Flow Visualization Tool

Understanding information flow in transformer-based language models is crucial for mechanistic interpretability. We introduce a visualization tool that extracts and represents attention patterns across model components, revealing how tokens influence each other during processing. Our tool automatically identifies and color-codes functional attention head types based on established taxonomies from recent research on indirect object identification (Wang et al., 2022), factual recall (Chughtai et al., 2024), and factual association retrieval (Geva et al., 2023). This interactive approach enables researchers to trace information propagation through transformer architectures, providing deeper insights into how these models implement reasoning and knowledge retrieval capabilities.

Read More

Mar 18, 2025

LLM Military Decision-Making Under Uncertainty: A Simulation Study

LLMs tested in military decision scenarios typically favor diplomacy over conflict, though uncertainty and chain-of-thought reasoning increase aggressive recommendations. This suggests context-specific limitations for LLM-based military decision support.

Read More

Mar 18, 2025

Inspiring People to Go into RL Interp

This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.

Read More

Mar 18, 2025

Morph: AI Safety Education Adaptable to (Almost) Anyone

One-liner: Morph is the ultimate operation stack for AI safety education—combining dynamic localization, policy simulations, and ecosystem tools to turn abstract risks into actionable, culturally relevant solutions for learners worldwide.

AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences. We address these gaps with an integrated platform combining culturally adaptive content (e.g. policy simulations), learning + career pathway mapper, and tools ecosystem to democratize AI safety education.

Our MVP features a dynamic localization that tailors case studies, risk scenarios, and policy examples to users’ cultural and regional contexts (e.g., healthcare AI governance in Southeast Asia vs. the EU). This engine adjusts references, and frameworks to align with local values. We integrate transformer-based localization, causal inference for policy outcomes, and graph-based matching, providing a scalable framework for inclusive AI safety education. This approach bridges theory and practice, ensuring solutions reflect the diversity of societies they aim to protect. In future works, we map out the partnership we’re currently establishing to use Morph beyond this hackathon.

Read More

Mar 18, 2025

Interactive Assessments for AI Safety: A Gamified Approach to Evaluation and Personal Journey Mapping

An interactive assessment platform and mentor chatbot hosted on Canvas LMS, for testing and guiding learners from BlueDot's Intro to Transformative AI Course.

Read More

Mar 18, 2025

Mechanistic Interpretability Track: Neuronal Pathway Coverage

Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.

Read More

Mar 18, 2025

Preparing for Accelerated AGI Timelines

This project examines the prospect of near-term AGI from multiple angles—careers, finances, and logistical readiness. Drawing on various discussions from LessWrong, it highlights how entrepreneurs and those who develop AI-complementary skills may thrive under accelerated timelines, while traditional, incremental career-building could falter. Financial preparedness focuses on striking a balance between stable investments (like retirement accounts) and riskier, AI-exposed opportunities, with an emphasis on retaining adaptability amid volatile market conditions. Logistical considerations—housing decisions, health, and strong social networks—are shown to buffer against unexpected disruptions if entire industries or locations are suddenly reshaped by AI. Together, these insights form a practical roadmap for individuals seeking to navigate the uncertainties of an era when AGI might rapidly transform both labor markets and daily life.

Read More

Mar 18, 2025

Identification if AI generated content

Our project falls within the Social Sciences track, focusing on the identification of AI-generated text content and its societal impact. A significant portion of online content is now AI-generated, often exhibiting a level of quality and human-likeness that makes it indistinguishable from human-created content. This raises concerns regarding misinformation, authorship transparency, and trust in digital communication.

Read More