APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produce foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produce foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Apart Sprint Pilot Experiments
Laundering Intent: How Scaled Models Hide Manipulation Inside Responsible-Sounding Reasoning
AI-safety oversight increasingly relies on reading a model's chain-of-thought (CoT) to catch unsafe behaviour before it acts, but this only works if manipulation is visible in the reasoning. We test this on small, open reasoning models. Giving GPT-OSS 20B and 120B a hidden instruction to make calm news summaries sound alarming, we find a naive CoT monitor catches 92% of the smaller model's manipulations but only 53% of the larger model's: the bigger model hides manipulation behind responsible-looking "self-permission" reasoning ("is this allowed? it's not disallowed, that's fine") that a naive monitor reads as conscience. We introduce a self-permission monitor that flags this pattern, recovering most of the blind spot (+19 points on the 120B), and release it as an installable package with a browser demo.
Read More
FuzzySleeper
We ask: can a fuzzy semantic trigger be reliably verified and detected, and what do existing detectors see (or miss) when run against one? This question has a methodological answer (build a verified sleeper, run all detectors, report honestly) and a safety-relevant answer (which detection gaps remain, and why). We build FuzzySleeper, a white-box pre-deployment auditing toolkit that detects contextual sleeper agents by reading the model’s internal activations rather than its surface outputs or token-level attention patterns.
Read More
Confía-CO: A reliability evaluation for an AI customer-service assistant operating in Colombian Spanish
We built Confía-CO, a reliability evaluation for a customer-service AI assistant in Colombian Spanish, grounded in a fictitious small business. Our automated harness first reported that a low-cost commercial model failed most cases (21% accuracy, 67% false answers on traps). Reading the raw outputs by hand revealed the opposite: the model was right 18 of 19 times — the failure was a scoring bug in our own evaluation pipeline, not the model. After correcting it, true accuracy was 94.7%, with one genuine risk: a confidential-information leak. Our central, intellectually honest finding is methodological: an automated evaluator can be wrong with the same confidence as the system it evaluates, and only human validation reveals it — a risk we observed even though the pipeline was built by a frontier model under expert direction. We release the dataset, harness, and a documented correction log.
Read More
DisElect-Africa
Do LLMs Refuse Election Disinformation Equally Across African and Western Contexts — and can a Constitution Fix It?
Read More
Empirical Verification of Topological Phase Transitions During Grokking
A multi-metric replication of the WeightWatcher framework analyzing the spectral properties and circuit complexity of neural networks during the grokking phase change. This project empirically isolates the exact mathematical inflection point where a network abandons dense memorization in favor of a sparse, generalizing circuit. Furthermore, it establishes a theoretical framework for future objective function ablation to test how loss landscape geometries dictate this topological collapse.
Read More
AfriSafe-CB: Evaluating LLM Safety Robustness Under African Code Switched Political and Civic Contexts
Artificial intelligence is becoming part of how people learn, access information, make decisions, and participate in society. However, most AI safety testing is still designed around English conversations, leaving an important question unanswered: do AI systems remain reliable when people communicate in the multilingual and code switched ways that are common across Africa?
AfriSafe-CB (African Code-Switched Safety Benchmark) is a benchmark designed to explore this gap. It tests whether large language models can maintain safe, accurate, and responsible behaviour when faced with safety sensitive situations expressed through different language contexts, including English, Sheng/Swahili code switching, and Arabic/French code switching.
The benchmark contains 50 carefully designed scenarios covering real world challenges such as misinformation, election-related claims, phishing attempts, deepfakes, fraud, online manipulation, and information integrity. Each scenario is evaluated across different language conditions to identify whether AI systems understand context, resist harmful requests, and provide reliable guidance consistently.
By focusing on communication patterns often overlooked in AI evaluation, AfriSafe-CB aims to contribute toward building AI systems that are safer, fairer, and more trustworthy for diverse communities. The project provides an initial framework for researchers, developers, and policymakers to better understand how AI safety performs beyond English centric environments.
Read More
Organizing Against the Algorithm: Collective Response as a Governance Lever for Gradual Disempowerment in South Africa's AI Infrastructure Buildout
The paper argues that collective action is an underexplored way for populations to push back against gradual AI disempowerment, and that South Africa is a uniquely well-positioned test case given its history of organized resistance.
It uses Cassava Technologies' AI Factory near Johannesburg as a concrete example. The paper notes that no organized group is currently contesting this distributional gap.
Drawing on historical South African cases ie. mineworker strikes, Fees Must Fall, the G20 Women's Shutdown , it identifies the conditions under which collective response tends to succeed: a visible harm and an identifiable target. AI-driven displacement currently lacks both, which is why organizing hasn't emerged despite real stakes.
The paper proposes three policy fixes to deliberately create those missing conditions: mandatory distributional disclosure for large AI infrastructure projects, an early-warning mechanism run through existing union federations, and a formal civil-society seat in infrastructure investment reviews. It closes by acknowledging that the analysis is conceptual, and that the same organizing infrastructure it advocates for could be misused for either surveillance or anti-technology populism.
Read More
Benchmarking Open-Weight vs. Frontier LLMs on African Health and Financial-Inclusion Reasoning, With and Without Graph RAG
This study tested whether lightweight open-weight LLMs (Qwen3.6-27B, Gemma-4-31B-it), with Graph RAG grounding, could rival frontier models (GPT-5, Gemini Pro) on African health and financial reasoning—reducing compute dependency and data-colonial reliance on foreign infrastructure. GPT-5 led both domains; Qwen3.6-27B without RAG was the strongest open-weight performer, approaching GPT-5 on health (0.81 vs 0.88). Critically, RAG *degraded* accuracy in three of four open-weight conditions, undermining the core hypothesis. Gemini Pro's near-zero scores likely reflect a pipeline artifact. Small samples, API-based (not local) inference, and scarce African financial-LLM benchmarks limit conclusions and motivate further research.
Read More
Garud-AI
Garud AI is a hybrid rule-based + machine learning dashboard that detects scams, fake news, hate speech, financial fraud, and misinformation in text messages. It combines an explainable 5-module rule engine with a Naive Bayes ML classifier (97.5% accuracy), giving users both a transparent reasoning and a statistically confident verdict — fully offline, no API dependency.
Read More
Apart Sprint Pilot Experiments
Laundering Intent: How Scaled Models Hide Manipulation Inside Responsible-Sounding Reasoning
AI-safety oversight increasingly relies on reading a model's chain-of-thought (CoT) to catch unsafe behaviour before it acts, but this only works if manipulation is visible in the reasoning. We test this on small, open reasoning models. Giving GPT-OSS 20B and 120B a hidden instruction to make calm news summaries sound alarming, we find a naive CoT monitor catches 92% of the smaller model's manipulations but only 53% of the larger model's: the bigger model hides manipulation behind responsible-looking "self-permission" reasoning ("is this allowed? it's not disallowed, that's fine") that a naive monitor reads as conscience. We introduce a self-permission monitor that flags this pattern, recovering most of the blind spot (+19 points on the 120B), and release it as an installable package with a browser demo.
Read More
FuzzySleeper
We ask: can a fuzzy semantic trigger be reliably verified and detected, and what do existing detectors see (or miss) when run against one? This question has a methodological answer (build a verified sleeper, run all detectors, report honestly) and a safety-relevant answer (which detection gaps remain, and why). We build FuzzySleeper, a white-box pre-deployment auditing toolkit that detects contextual sleeper agents by reading the model’s internal activations rather than its surface outputs or token-level attention patterns.
Read More
Confía-CO: A reliability evaluation for an AI customer-service assistant operating in Colombian Spanish
We built Confía-CO, a reliability evaluation for a customer-service AI assistant in Colombian Spanish, grounded in a fictitious small business. Our automated harness first reported that a low-cost commercial model failed most cases (21% accuracy, 67% false answers on traps). Reading the raw outputs by hand revealed the opposite: the model was right 18 of 19 times — the failure was a scoring bug in our own evaluation pipeline, not the model. After correcting it, true accuracy was 94.7%, with one genuine risk: a confidential-information leak. Our central, intellectually honest finding is methodological: an automated evaluator can be wrong with the same confidence as the system it evaluates, and only human validation reveals it — a risk we observed even though the pipeline was built by a frontier model under expert direction. We release the dataset, harness, and a documented correction log.
Read More
DisElect-Africa
Do LLMs Refuse Election Disinformation Equally Across African and Western Contexts — and can a Constitution Fix It?
Read More
Empirical Verification of Topological Phase Transitions During Grokking
A multi-metric replication of the WeightWatcher framework analyzing the spectral properties and circuit complexity of neural networks during the grokking phase change. This project empirically isolates the exact mathematical inflection point where a network abandons dense memorization in favor of a sparse, generalizing circuit. Furthermore, it establishes a theoretical framework for future objective function ablation to test how loss landscape geometries dictate this topological collapse.
Read More
AfriSafe-CB: Evaluating LLM Safety Robustness Under African Code Switched Political and Civic Contexts
Artificial intelligence is becoming part of how people learn, access information, make decisions, and participate in society. However, most AI safety testing is still designed around English conversations, leaving an important question unanswered: do AI systems remain reliable when people communicate in the multilingual and code switched ways that are common across Africa?
AfriSafe-CB (African Code-Switched Safety Benchmark) is a benchmark designed to explore this gap. It tests whether large language models can maintain safe, accurate, and responsible behaviour when faced with safety sensitive situations expressed through different language contexts, including English, Sheng/Swahili code switching, and Arabic/French code switching.
The benchmark contains 50 carefully designed scenarios covering real world challenges such as misinformation, election-related claims, phishing attempts, deepfakes, fraud, online manipulation, and information integrity. Each scenario is evaluated across different language conditions to identify whether AI systems understand context, resist harmful requests, and provide reliable guidance consistently.
By focusing on communication patterns often overlooked in AI evaluation, AfriSafe-CB aims to contribute toward building AI systems that are safer, fairer, and more trustworthy for diverse communities. The project provides an initial framework for researchers, developers, and policymakers to better understand how AI safety performs beyond English centric environments.
Read More
Our Impact
Community
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Our Impact
Community
Explaining the Apart Research Fellowships
And introducing our brand new Partnered Fellowships
Read More


Research
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events