Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Learn More

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Learn More

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Learn More

Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI

Publishing rigorous empirical work for safe AI: evaluations, interpretability and more

Learn More

Novel Approaches

Our research is underpinned by novel approaches focused on neglected topics

Learn More

Pilot Experiments

Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety

Learn More

Highlights

GPT-4o is capable of complex cyber offense tasks:

We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.

A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Factual model editing techniques don't edit facts:

Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.

J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023

Highlights

GPT-4o is capable of complex cyber offense tasks:

We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.

A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Factual model editing techniques don't edit facts:

Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.

J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023

Research Focus Areas

View All

AI Security & Safety

Key Papers:

Catastrophic Cyber Capabilities Benchmark

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

Sleeper Agents: Training Deceptive LLMs

Model Evaluation & Testing

Key Papers:

DarkBench: Benchmarking Dark Patterns in Large Language Models

Sandbag Detection through Model Impairment

Detecting Edit Failures in Large Language Models

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Mechanistic Interpretability

Key Papers:

Interpreting Learned Feedback Patterns in Large Language Models

Understanding Addition in Transformers

Neuron to Graph: Interpreting Language Model Neurons at Scale

Multi-Agent Systems

Key Papers:

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

Comprehensive report on multi-agent risks

Research Focus Areas

View All

AI Security & Safety

Key Papers:

Catastrophic Cyber Capabilities Benchmark

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

Sleeper Agents: Training Deceptive LLMs

Model Evaluation & Testing

Key Papers:

DarkBench: Benchmarking Dark Patterns in Large Language Models

Sandbag Detection through Model Impairment

Detecting Edit Failures in Large Language Models

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Mechanistic Interpretability

Key Papers:

Interpreting Learned Feedback Patterns in Large Language Models

Understanding Addition in Transformers

Neuron to Graph: Interpreting Language Model Neurons at Scale

Multi-Agent Systems

Key Papers:

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

Comprehensive report on multi-agent risks

Research Index

View All

NOV 18, 2024

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

NOV 2, 2024

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

oct 18, 2024

benchmarks

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

sep 25, 2024

Interpretability

Interpreting Learned Feedback Patterns in Large Language Models

feb 23, 2024

Interpretability

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

feb 4, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits

jan 14, 2024

conceptual

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

jan 3, 2024

Interpretability

Large Language Models Relearn Removed Concepts

nov 28, 2023

Interpretability

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

nov 23, 2023

Interpretability

Understanding addition in transformers

nov 7, 2023

Interpretability

Locating cross-task sequence continuation circuits in transformers

jul 10, 2023

benchmarks

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

may 5, 2023

Interpretability

Interpreting language model neurons at scale

Research Index

View All

NOV 18, 2024

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

NOV 2, 2024

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

oct 18, 2024

benchmarks

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

sep 25, 2024

Interpretability

Interpreting Learned Feedback Patterns in Large Language Models

feb 23, 2024

Interpretability

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

feb 4, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits

jan 14, 2024

conceptual

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

jan 3, 2024

Interpretability

Large Language Models Relearn Removed Concepts

nov 28, 2023

Interpretability

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

nov 23, 2023

Interpretability

Understanding addition in transformers

nov 7, 2023

Interpretability

Locating cross-task sequence continuation circuits in transformers

jul 10, 2023

benchmarks

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

may 5, 2023

Interpretability

Interpreting language model neurons at scale

Apart Sprint Pilot Experiments

View All

Nov 24, 2025

TEEs as a Cryptographic Nervous System for Onshored Humanoid Robots

If cheap, capable humanoid robots are deployed at scale and connected to powerful AI models, the central risk is not only “misaligned agents” but losing reliable control over physical actuators. As humanoids move from lab curiosities to industrial and domestic platforms, even a single exploit, backdoor, or bad update could repurpose fleets of robots at once. Recent experiments have shown that robots driven by large language models (LLMs) can be induced, via prompt‑level attacks, to perform dangerous actions despite software‑level safety rules. At the same time, governments and industry are openly planning for mass deployment of humanoids into manufacturing and care settings, with national strategies explicitly framing them as a pillar of future industrial capacity.

We explore an architecture that treats robots as cryptographic devices whose ability to move is gated by trusted execution environments (TEEs), on‑chain registries, and onshored manufacturing. Concretely, we assume a two‑layer control stack: a high‑frequency local controller (potentially a learned policy trained in high‑fidelity simulators such as NVIDIA Isaac Sim) closes the inner loop, while larger vision‑language‑action (VLA) models issue lower‑frequency plans or subgoals. Both layers are attested; actuators only accept commands that originate from code whose measurement matches an approved manifest and that pass through a TEE‑enforced “cryptographic nervous system.” We do not attempt to prove that any given policy is aligned. Instead, we enforce that only attested controller code, using a specific verifiable configuration, can authorize motion—and that only robots listed in a Robot Registry may run models listed in a Model Registry.

As a first step we built a self‑hosted confidential‑compute testbed using dstack (https://vmm.agioh.io/) and a TEE‑backed key‑management service (https://kms.agioh.io/) anchored on the Base chain. On top of this we implemented an attested model runtime (neko_agent), high‑performance WebRTC bindings for low‑latency control, and a deterministic minimal OS (sixos) for reproducible builds. On the hardware side, we designed a dual‑arm humanoid platform that uses forged carbon‑fiber composites and 3D‑printed tooling for rapid, onshore, low‑capex manufacturing that could eventually be partially automated by robots themselves—while remaining under TEE‑enforced control.

Keywords: humanoid robots, trusted execution environments, defensive acceleration, smart contracts, forged carbon fiber, secure remote control

Nov 24, 2025

WikiGen: Bio‑logical safeguards for collaborative AI/ML on sensitive data

Looking for data, model improvements, ie.. cures; WikiGen is an open sourced protocol connecting databases and a user’s data to allow selective consensus for a given inquiry, based on collective and/or private knowledge feeds. Particular to you, WikiGen evaluates a query, protocol or research objective, even an entire database submission against our internal base of researchers, commercial to academic partners and concerned healthcare consumers’ data repos. to securely and with privacy-first preserving means, discover missing components that exist already or possibilities for data collection campaigns as solutions (similar to bounties used in hacks). This way science, may not have to be repeated locally for the same vain, in vain, but rather shared to save time or money in R&D globally across. Animal model systems may not be needed to run experiments on, if the data exists already or if there is enough to simulate such animal models. Perhaps, the instrument you are about to optimize for a lab assay has been, before, performed exactly for your exact experiment constrains, however, not published or digitally accessible to your very credentials. Maybe you are a licensed MD or RN looking for insulin ASAP, or the rates of Asthma in children changing specific to your zip code. If the information is collected and exists in a database, all you need is a trustless way to request for it and prove you can be trusted!

Currently, there is no way for researchers to “google” i.e. BLAST search align a SEQ file (GCAT to CGTA similarity) to locate plasmids or biological materials off of the input compliment target by percent homology of sequence (0-100% exact mach) overlap in data bases they have no authority to search in. Well yes.. for safety. However, in life science these databases are constantly being shared, in correctly logged, mailed and collaborated on. The models, benchmarks, scripts, standard operating procedure protocols and barcoded physical inventory carefully catalogued. How do we connect such very difficult to acquire, in the first place, tools and resources that may hold specific answers to someone else’s difficult question, if the data is collected and not shared, reused, or even not allowed to be open sourced. Such that If I work at BioMarin, I do not need to have friends in Amgen’s gene therapy group to know what sample plasmids they possibly may have in their freezer repertoire for their own IP campaigns, to be able to determine who to email for a sample. I'd much rather the void alert me of what I do not know, which I need based off of my current needs.

Researchers and ML/AI agentic pipelines today require new methods to access data. Especially sensitive data that can be used for harm, without stunting the goodness that arrises from research aims, when utilized for the intended purposes. In antibody discovery, neuroscience research, oncology targets and in metabolomics, sequenced library sets are almost barely ever publicly added to open source gene banks. Publications in biology, such as Nature have priorly risked the identities of patients by publishing tissue sequence of sc-RNA matrices data, which other individuals were able to de-identify the patients. Publishing costs money and occasionally you get a very mean or confused reviewer, preventing critical relay channels to share important data or protocols in a heist matter. Lastly, bioinformatic or computational biology data scientists, unite in complaints of code not published to the paper, or not the right files or script versions publicly available. Institutions or biopharma corporations are only held accountable to incentivize cross dynamic synergy benefits to their entity's already self involved and financed collaborations, even when policy and laws exist such as to share failed clinical trial data (the U.S. states as law), they fail in implementation : clinicaltrials.gov does not support, nor has anyone significant noticed this is not being maintained, since 2013, for missing reports of clinical trial master files to be required for upload into an active registry for the, permission approved, collective to make use.

Nov 24, 2025

Helix-Aegis: LLM Based screening for bio-sequences

We introduce Helix-Aegis, a prototype defensive screening system designed to

detect hazardous biological sequences (toxins, pathogens, virulence factors)

using fine-tuned Large Language Models. While sequence-to-function models

are likely to appear in future DNA synthesis screening pipelines, they currently

lack safety-aligned reasoning, uncertainty awareness, and risk classification. As

such models become more capable, attackers may fine-tune or adversarially

manipulate them to generate sequences that evade naive screening. Our goal is

to explore whether LLM-based guardrails, analogous to LlamaGuard in

text-LLMs, can improve the robustness, interpretability, and safety of

protein-sequence screening.

Nov 24, 2025

Detecting Piecewise Cyber Espionage in Model APIs

Detecting Piecewise Cyber Espionage in Model APIs -

On November 13th 2025, Anthropic published

a report on an AI-orchestrated cyber espionage

campaign. Threat actors used various tech-

niques to circumvent model safeguards and

used Claude Code with agentic scaffolding

to automate large parts of their campaigns.

Specifically, threat actors split campaigns into

subtasks that in isolation appeared benign. It is

therefore important to find methods to protect

against such attacks to ensure that misuse of

AI in the cyber domain can be minimized.

To address this, we propose a novel method

for detecting malicious activity in model

APIs across piecewise benign requests. We

simulated malicious campaigns using an

agentic red-team scaffold similar to what

was described in the reported attack. We

demonstrate that individual attacks are blocked

by simple guardrails using Llama Guard 3.

Then, we demonstrate that by splitting up the

attack, this bypasses the guardrail. For the

attacks that get through, a classifier model

is introduced that detects those attacks, and

a statistical validation of the detections is

provided. This result paves the way to detect

future automated attacks of this kind.

Nov 24, 2025

Aegis Sentinel Multi-Domain Defensive Acceleration Platform for Critical Infrastructure Protection

We present Aegis Sentinel, a defensive acceleration (simulation and analysis)

platform addressing the convergence of AI-enabled threats, critical

infrastructure vulnerabilities, and cross-domain attack vectors, such as attacks

on sovereign oceanic-based datacenters. Our system integrates AI safety

validation, biosecurity surveillance, cybersecurity intelligence, and privacy-

preserving coordination within a unified framework to anticipate kill chain

stages and proactively prepare with research-grounded mitigation steps, where

traditional security paradigms prove insufficient.

Nov 24, 2025

Opening Doors to Multimodal Deception

Organizations increasingly deploy vision-language models (VLMs) locally to protect sensitive data, yet there is almost no empirical work on whether these multimodal systems can systematically lie about visual content or whether text-based safety tools transfer to visual contexts. We present, to our knowledge, the first systematic study of deception detection in VLMs. First, we construct a VLM deception benchmark with 1,048 image–text pairs derived from COCO and show that a standard VLM can be reliably induced to contradict ground-truth visual information, while a linear probe on hidden activations detects these multimodal lies with 98.7% AUROC. Second, we perform a geometric analysis of deception representations across a text-only LLM and a VLM, finding high alignment (cosine similarity > 0.85) between deception directions in the text backbone and in the VLM text stream, suggesting that deception is encoded as a modality-independent concept rather than a purely task-specific feature. Third, we demonstrate cross-modal transfer for deception detection: a probe trained only on text-based deception data attains 78.5% AUROC on visual deception, corresponding to 57% transfer efficiency. These results indicate that text-trained probes can provide useful zero-shot monitoring for local VLM deployments, while also highlighting open questions about modality-specific failure modes and the limits of cross-modal safety transfer.

Nov 24, 2025

Gene Guard: Real-Time Genomic Data Leak Prevention

Genomic data (such as DNA sequences) are both highly sensitive and uniquely vulnerable to accidental leakage in modern collaborative and AI-assisted workflows. This report introduces Gene-Guard, a two-layer detection and prevention system designed to safeguard genomic sequences from unauthorized exposure in real time. We articulate the growing risk through real-world scenarios: researchers inadvertently pasting DNA sequences into chat platforms or AI tools, and bio-lab staff emailing unencrypted gene data. To address this gap, Gene-Guard integrates an open DNA sequence dataset for training and benchmarking machine learning classifiers that flag genomic data with high recall and precision. Upon detection, an optional second layer provides an LLM-powered risk analysis, identifying the sequence and its potential biosecurity implications using tools like IBBIS’s Common Mechanism and SeqScreen. Emphasis is placed on a lightweight, fast, platform-agnostic tool that can plug into any Data Loss Prevention pipeline without heavy resource demands or latency. Our results suggest Gene-Guard can significantly reduce the chance of genomic data leaks without impeding researchers’ workflows, offering a novel safety net at the intersection of cyber and biosecurity.

Nov 24, 2025

TinyRod

Phishing is still the easiest way into most organisations, despite years of machine-learning filters boasting 95%+ accuracy on benchmark datasets. Those models usually live in centralised paid services that demand full access to email content, create single points of failure, and still struggle with fast-moving, AI-generated content and landing pages.

TinyRod is a proof-of-concept email security tool that asks a simple question: how far can we push phishing detection if everything runs locally in the user’s browser? The project couples a Chrome extension with small open-source language models running via WebLLM, so email content never leaves the browser. When a user clicks “scan”, TinyRod pulls structured metadata from the page, feeds it to an SLM, and returns a 0–100 risk score plus a small, fixed list of “red flag” reasons. That keeps the experience close to how people already use email.

The result is a hackathon-built but realistic architecture that shows SLMs in the ~1.7B–8B range can meaningfully support phishing defence at the edge, with clear privacy trade-offs, a documented threat model, and an obvious path for hardening and future work.

Nov 24, 2025

Zero Trust Agency: Mitigating the Confused Deputy in Autonomous AI Systems

This project addresses the critical "Confused Deputy" vulnerability in Agentic AI, where agents operating with static "God-mode" permissions are hijacked via prompt injection to attack enterprise systems. We developed a Zero Trust Architecture that replaces these static credentials with Identity Propagation via RFC 8693 (Token Exchange). This compels agents to cryptographically "borrow" the human user's identity ("On-Behalf-Of") for every interaction, ensuring they can only access resources explicitly authorized for that user. Our MVP simulation demonstrated that this architecture deterministically blocks 100% of privilege escalation and lateral movement attempts by enforcing hard permission boundaries at the infrastructure layer, rendering prompt injection attacks ineffective against unauthorized tools.