APART RESEARCH

Impactful AI safety research

Explore our projects, publications and pilot experiments

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produce foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Our Approach

Arrow

Our research focuses on critical research paradigms in AI Safety. We produce foundational research enabling the safe and beneficial development of advanced AI.

Arrow

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Apart Sprint Pilot Experiments

Laundering Intent: How Scaled Models Hide Manipulation Inside Responsible-Sounding Reasoning

AI-safety oversight increasingly relies on reading a model's chain-of-thought (CoT) to catch unsafe behaviour before it acts, but this only works if manipulation is visible in the reasoning. We test this on small, open reasoning models. Giving GPT-OSS 20B and 120B a hidden instruction to make calm news summaries sound alarming, we find a naive CoT monitor catches 92% of the smaller model's manipulations but only 53% of the larger model's: the bigger model hides manipulation behind responsible-looking "self-permission" reasoning ("is this allowed? it's not disallowed, that's fine") that a naive monitor reads as conscience. We introduce a self-permission monitor that flags this pattern, recovering most of the blind spot (+19 points on the 120B), and release it as an installable package with a browser demo.

Read More

FuzzySleeper

We ask: can a fuzzy semantic trigger be reliably verified and detected, and what do existing detectors see (or miss) when run against one? This question has a methodological answer (build a verified sleeper, run all detectors, report honestly) and a safety-relevant answer (which detection gaps remain, and why). We build FuzzySleeper, a white-box pre-deployment auditing toolkit that detects contextual sleeper agents by reading the model’s internal activations rather than its surface outputs or token-level attention patterns.

Read More

Confía-CO: A reliability evaluation for an AI customer-service assistant operating in Colombian Spanish

We built Confía-CO, a reliability evaluation for a customer-service AI assistant in Colombian Spanish, grounded in a fictitious small business. Our automated harness first reported that a low-cost commercial model failed most cases (21% accuracy, 67% false answers on traps). Reading the raw outputs by hand revealed the opposite: the model was right 18 of 19 times — the failure was a scoring bug in our own evaluation pipeline, not the model. After correcting it, true accuracy was 94.7%, with one genuine risk: a confidential-information leak. Our central, intellectually honest finding is methodological: an automated evaluator can be wrong with the same confidence as the system it evaluates, and only human validation reveals it — a risk we observed even though the pipeline was built by a frontier model under expert direction. We release the dataset, harness, and a documented correction log.

Read More

DisElect-Africa

Do LLMs Refuse Election Disinformation Equally Across African and Western Contexts — and can a Constitution Fix It?

Read More

Empirical Verification of Topological Phase Transitions During Grokking

A multi-metric replication of the WeightWatcher framework analyzing the spectral properties and circuit complexity of neural networks during the grokking phase change. This project empirically isolates the exact mathematical inflection point where a network abandons dense memorization in favor of a sparse, generalizing circuit. Furthermore, it establishes a theoretical framework for future objective function ablation to test how loss landscape geometries dictate this topological collapse.

Read More

AfriSafe-CB: Evaluating LLM Safety Robustness Under African Code Switched Political and Civic Contexts

Artificial intelligence is becoming part of how people learn, access information, make decisions, and participate in society. However, most AI safety testing is still designed around English conversations, leaving an important question unanswered: do AI systems remain reliable when people communicate in the multilingual and code switched ways that are common across Africa?

AfriSafe-CB (African Code-Switched Safety Benchmark) is a benchmark designed to explore this gap. It tests whether large language models can maintain safe, accurate, and responsible behaviour when faced with safety sensitive situations expressed through different language contexts, including English, Sheng/Swahili code switching, and Arabic/French code switching.

The benchmark contains 50 carefully designed scenarios covering real world challenges such as misinformation, election-related claims, phishing attempts, deepfakes, fraud, online manipulation, and information integrity. Each scenario is evaluated across different language conditions to identify whether AI systems understand context, resist harmful requests, and provide reliable guidance consistently.

By focusing on communication patterns often overlooked in AI evaluation, AfriSafe-CB aims to contribute toward building AI systems that are safer, fairer, and more trustworthy for diverse communities. The project provides an initial framework for researchers, developers, and policymakers to better understand how AI safety performs beyond English centric environments.

Read More

Organizing Against the Algorithm: Collective Response as a Governance Lever for Gradual Disempowerment in South Africa's AI Infrastructure Buildout

The paper argues that collective action is an underexplored way for populations to push back against gradual AI disempowerment, and that South Africa is a uniquely well-positioned test case given its history of organized resistance.

It uses Cassava Technologies' AI Factory near Johannesburg as a concrete example. The paper notes that no organized group is currently contesting this distributional gap.

Drawing on historical South African cases ie. mineworker strikes, Fees Must Fall, the G20 Women's Shutdown , it identifies the conditions under which collective response tends to succeed: a visible harm and an identifiable target. AI-driven displacement currently lacks both, which is why organizing hasn't emerged despite real stakes.

The paper proposes three policy fixes to deliberately create those missing conditions: mandatory distributional disclosure for large AI infrastructure projects, an early-warning mechanism run through existing union federations, and a formal civil-society seat in infrastructure investment reviews. It closes by acknowledging that the analysis is conceptual, and that the same organizing infrastructure it advocates for could be misused for either surveillance or anti-technology populism.

Read More

Benchmarking Open-Weight vs. Frontier LLMs on African Health and Financial-Inclusion Reasoning, With and Without Graph RAG

This study tested whether lightweight open-weight LLMs (Qwen3.6-27B, Gemma-4-31B-it), with Graph RAG grounding, could rival frontier models (GPT-5, Gemini Pro) on African health and financial reasoning—reducing compute dependency and data-colonial reliance on foreign infrastructure. GPT-5 led both domains; Qwen3.6-27B without RAG was the strongest open-weight performer, approaching GPT-5 on health (0.81 vs 0.88). Critically, RAG *degraded* accuracy in three of four open-weight conditions, undermining the core hypothesis. Gemini Pro's near-zero scores likely reflect a pipeline artifact. Small samples, API-based (not local) inference, and scarce African financial-LLM benchmarks limit conclusions and motivate further research.

Read More

Garud-AI

Garud AI is a hybrid rule-based + machine learning dashboard that detects scams, fake news, hate speech, financial fraud, and misinformation in text messages. It combines an explainable 5-module rule engine with a Naive Bayes ML classifier (97.5% accuracy), giving users both a transparent reasoning and a statistically confident verdict — fully offline, no API dependency.

Read More

Apart Sprint Pilot Experiments

Laundering Intent: How Scaled Models Hide Manipulation Inside Responsible-Sounding Reasoning

AI-safety oversight increasingly relies on reading a model's chain-of-thought (CoT) to catch unsafe behaviour before it acts, but this only works if manipulation is visible in the reasoning. We test this on small, open reasoning models. Giving GPT-OSS 20B and 120B a hidden instruction to make calm news summaries sound alarming, we find a naive CoT monitor catches 92% of the smaller model's manipulations but only 53% of the larger model's: the bigger model hides manipulation behind responsible-looking "self-permission" reasoning ("is this allowed? it's not disallowed, that's fine") that a naive monitor reads as conscience. We introduce a self-permission monitor that flags this pattern, recovering most of the blind spot (+19 points on the 120B), and release it as an installable package with a browser demo.

Read More

FuzzySleeper

We ask: can a fuzzy semantic trigger be reliably verified and detected, and what do existing detectors see (or miss) when run against one? This question has a methodological answer (build a verified sleeper, run all detectors, report honestly) and a safety-relevant answer (which detection gaps remain, and why). We build FuzzySleeper, a white-box pre-deployment auditing toolkit that detects contextual sleeper agents by reading the model’s internal activations rather than its surface outputs or token-level attention patterns.

Read More

Confía-CO: A reliability evaluation for an AI customer-service assistant operating in Colombian Spanish

We built Confía-CO, a reliability evaluation for a customer-service AI assistant in Colombian Spanish, grounded in a fictitious small business. Our automated harness first reported that a low-cost commercial model failed most cases (21% accuracy, 67% false answers on traps). Reading the raw outputs by hand revealed the opposite: the model was right 18 of 19 times — the failure was a scoring bug in our own evaluation pipeline, not the model. After correcting it, true accuracy was 94.7%, with one genuine risk: a confidential-information leak. Our central, intellectually honest finding is methodological: an automated evaluator can be wrong with the same confidence as the system it evaluates, and only human validation reveals it — a risk we observed even though the pipeline was built by a frontier model under expert direction. We release the dataset, harness, and a documented correction log.

Read More

DisElect-Africa

Do LLMs Refuse Election Disinformation Equally Across African and Western Contexts — and can a Constitution Fix It?

Read More

Empirical Verification of Topological Phase Transitions During Grokking

A multi-metric replication of the WeightWatcher framework analyzing the spectral properties and circuit complexity of neural networks during the grokking phase change. This project empirically isolates the exact mathematical inflection point where a network abandons dense memorization in favor of a sparse, generalizing circuit. Furthermore, it establishes a theoretical framework for future objective function ablation to test how loss landscape geometries dictate this topological collapse.

Read More

AfriSafe-CB: Evaluating LLM Safety Robustness Under African Code Switched Political and Civic Contexts

Artificial intelligence is becoming part of how people learn, access information, make decisions, and participate in society. However, most AI safety testing is still designed around English conversations, leaving an important question unanswered: do AI systems remain reliable when people communicate in the multilingual and code switched ways that are common across Africa?

AfriSafe-CB (African Code-Switched Safety Benchmark) is a benchmark designed to explore this gap. It tests whether large language models can maintain safe, accurate, and responsible behaviour when faced with safety sensitive situations expressed through different language contexts, including English, Sheng/Swahili code switching, and Arabic/French code switching.

The benchmark contains 50 carefully designed scenarios covering real world challenges such as misinformation, election-related claims, phishing attempts, deepfakes, fraud, online manipulation, and information integrity. Each scenario is evaluated across different language conditions to identify whether AI systems understand context, resist harmful requests, and provide reliable guidance consistently.

By focusing on communication patterns often overlooked in AI evaluation, AfriSafe-CB aims to contribute toward building AI systems that are safer, fairer, and more trustworthy for diverse communities. The project provides an initial framework for researchers, developers, and policymakers to better understand how AI safety performs beyond English centric environments.

Read More