APART RESEARCH
Impactful AI safety research
Explore our projects, publications and pilot experiments
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Our Approach

Our research focuses on critical research paradigms in AI Safety. We produces foundational research enabling the safe and beneficial development of advanced AI.

Safe AI
Publishing rigorous empirical work for safe AI: evaluations, interpretability and more
Novel Approaches
Our research is underpinned by novel approaches focused on neglected topics
Pilot Experiments
Apart Sprints have kickstarted hundreds of pilot experiments in AI Safety
Highlights


GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More


Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Read More
Read More
Highlights

GPT-4o is capable of complex cyber offense tasks:
We show realistic challenges for cyber offense can be completed by SoTA LLMs while open source models lag behind.
A. Anurin, J. Ng, K. Schaffer, J. Schreiber, E. Kran, Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More

Factual model editing techniques don't edit facts:
Model editing techniques can introduce unwanted side effects in neural networks not detected by existing benchmarks.
J. Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. ACL 2023
Read More
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Focus Areas
Multi-Agent Systems
Key Papers:
Comprehensive report on multi-agent risks
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Read More
Research Index
NOV 18, 2024
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Read More
NOV 2, 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Read More
oct 18, 2024
benchmarks
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Read More
sep 25, 2024
Interpretability
Interpreting Learned Feedback Patterns in Large Language Models
Read More
feb 23, 2024
Interpretability
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Read More
feb 4, 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Read More
jan 14, 2024
conceptual
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Read More
jan 3, 2024
Interpretability
Large Language Models Relearn Removed Concepts
Read More
nov 28, 2023
Interpretability
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Read More
nov 23, 2023
Interpretability
Understanding addition in transformers
Read More
nov 7, 2023
Interpretability
Locating cross-task sequence continuation circuits in transformers
Read More
jul 10, 2023
benchmarks
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Read More
may 5, 2023
Interpretability
Interpreting language model neurons at scale
Read More
Nov 24, 2025
TEEs as a Cryptographic Nervous System for Onshored Humanoid Robots
If cheap, capable humanoid robots are deployed at scale and connected to powerful AI models, the central risk is not only “misaligned agents” but losing reliable control over physical actuators. As humanoids move from lab curiosities to industrial and domestic platforms, even a single exploit, backdoor, or bad update could repurpose fleets of robots at once. Recent experiments have shown that robots driven by large language models (LLMs) can be induced, via prompt‑level attacks, to perform dangerous actions despite software‑level safety rules. At the same time, governments and industry are openly planning for mass deployment of humanoids into manufacturing and care settings, with national strategies explicitly framing them as a pillar of future industrial capacity.
We explore an architecture that treats robots as cryptographic devices whose ability to move is gated by trusted execution environments (TEEs), on‑chain registries, and onshored manufacturing. Concretely, we assume a two‑layer control stack: a high‑frequency local controller (potentially a learned policy trained in high‑fidelity simulators such as NVIDIA Isaac Sim) closes the inner loop, while larger vision‑language‑action (VLA) models issue lower‑frequency plans or subgoals. Both layers are attested; actuators only accept commands that originate from code whose measurement matches an approved manifest and that pass through a TEE‑enforced “cryptographic nervous system.” We do not attempt to prove that any given policy is aligned. Instead, we enforce that only attested controller code, using a specific verifiable configuration, can authorize motion—and that only robots listed in a Robot Registry may run models listed in a Model Registry.
As a first step we built a self‑hosted confidential‑compute testbed using dstack (https://vmm.agioh.io/) and a TEE‑backed key‑management service (https://kms.agioh.io/) anchored on the Base chain. On top of this we implemented an attested model runtime (neko_agent), high‑performance WebRTC bindings for low‑latency control, and a deterministic minimal OS (sixos) for reproducible builds. On the hardware side, we designed a dual‑arm humanoid platform that uses forged carbon‑fiber composites and 3D‑printed tooling for rapid, onshore, low‑capex manufacturing that could eventually be partially automated by robots themselves—while remaining under TEE‑enforced control.
Keywords: humanoid robots, trusted execution environments, defensive acceleration, smart contracts, forged carbon fiber, secure remote control
Read More
Read More
Nov 24, 2025
WikiGen: Bio‑logical safeguards for collaborative AI/ML on sensitive data
Looking for data, model improvements, ie.. cures; WikiGen is an open sourced protocol connecting databases and a user’s data to allow selective consensus for a given inquiry, based on collective and/or private knowledge feeds. Particular to you, WikiGen evaluates a query, protocol or research objective, even an entire database submission against our internal base of researchers, commercial to academic partners and concerned healthcare consumers’ data repos. to securely and with privacy-first preserving means, discover missing components that exist already or possibilities for data collection campaigns as solutions (similar to bounties used in hacks). This way science, may not have to be repeated locally for the same vain, in vain, but rather shared to save time or money in R&D globally across. Animal model systems may not be needed to run experiments on, if the data exists already or if there is enough to simulate such animal models. Perhaps, the instrument you are about to optimize for a lab assay has been, before, performed exactly for your exact experiment constrains, however, not published or digitally accessible to your very credentials. Maybe you are a licensed MD or RN looking for insulin ASAP, or the rates of Asthma in children changing specific to your zip code. If the information is collected and exists in a database, all you need is a trustless way to request for it and prove you can be trusted!
Currently, there is no way for researchers to “google” i.e. BLAST search align a SEQ file (GCAT to CGTA similarity) to locate plasmids or biological materials off of the input compliment target by percent homology of sequence (0-100% exact mach) overlap in data bases they have no authority to search in. Well yes.. for safety. However, in life science these databases are constantly being shared, in correctly logged, mailed and collaborated on. The models, benchmarks, scripts, standard operating procedure protocols and barcoded physical inventory carefully catalogued. How do we connect such very difficult to acquire, in the first place, tools and resources that may hold specific answers to someone else’s difficult question, if the data is collected and not shared, reused, or even not allowed to be open sourced. Such that If I work at BioMarin, I do not need to have friends in Amgen’s gene therapy group to know what sample plasmids they possibly may have in their freezer repertoire for their own IP campaigns, to be able to determine who to email for a sample. I'd much rather the void alert me of what I do not know, which I need based off of my current needs.
Researchers and ML/AI agentic pipelines today require new methods to access data. Especially sensitive data that can be used for harm, without stunting the goodness that arrises from research aims, when utilized for the intended purposes. In antibody discovery, neuroscience research, oncology targets and in metabolomics, sequenced library sets are almost barely ever publicly added to open source gene banks. Publications in biology, such as Nature have priorly risked the identities of patients by publishing tissue sequence of sc-RNA matrices data, which other individuals were able to de-identify the patients. Publishing costs money and occasionally you get a very mean or confused reviewer, preventing critical relay channels to share important data or protocols in a heist matter. Lastly, bioinformatic or computational biology data scientists, unite in complaints of code not published to the paper, or not the right files or script versions publicly available. Institutions or biopharma corporations are only held accountable to incentivize cross dynamic synergy benefits to their entity's already self involved and financed collaborations, even when policy and laws exist such as to share failed clinical trial data (the U.S. states as law), they fail in implementation : clinicaltrials.gov does not support, nor has anyone significant noticed this is not being maintained, since 2013, for missing reports of clinical trial master files to be required for upload into an active registry for the, permission approved, collective to make use.
Read More
Read More
Nov 24, 2025
Helix-Aegis: LLM Based screening for bio-sequences
We introduce Helix-Aegis, a prototype defensive screening system designed to
detect hazardous biological sequences (toxins, pathogens, virulence factors)
using fine-tuned Large Language Models. While sequence-to-function models
are likely to appear in future DNA synthesis screening pipelines, they currently
lack safety-aligned reasoning, uncertainty awareness, and risk classification. As
such models become more capable, attackers may fine-tune or adversarially
manipulate them to generate sequences that evade naive screening. Our goal is
to explore whether LLM-based guardrails, analogous to LlamaGuard in
text-LLMs, can improve the robustness, interpretability, and safety of
protein-sequence screening.
Read More
Read More
Nov 24, 2025
Detecting Piecewise Cyber Espionage in Model APIs
Detecting Piecewise Cyber Espionage in Model APIs -
On November 13th 2025, Anthropic published
a report on an AI-orchestrated cyber espionage
campaign. Threat actors used various tech-
niques to circumvent model safeguards and
used Claude Code with agentic scaffolding
to automate large parts of their campaigns.
Specifically, threat actors split campaigns into
subtasks that in isolation appeared benign. It is
therefore important to find methods to protect
against such attacks to ensure that misuse of
AI in the cyber domain can be minimized.
To address this, we propose a novel method
for detecting malicious activity in model
APIs across piecewise benign requests. We
simulated malicious campaigns using an
agentic red-team scaffold similar to what
was described in the reported attack. We
demonstrate that individual attacks are blocked
by simple guardrails using Llama Guard 3.
Then, we demonstrate that by splitting up the
attack, this bypasses the guardrail. For the
attacks that get through, a classifier model
is introduced that detects those attacks, and
a statistical validation of the detections is
provided. This result paves the way to detect
future automated attacks of this kind.
Read More
Read More
Nov 24, 2025
Aegis Sentinel Multi-Domain Defensive Acceleration Platform for Critical Infrastructure Protection
We present Aegis Sentinel, a defensive acceleration (simulation and analysis)
platform addressing the convergence of AI-enabled threats, critical
infrastructure vulnerabilities, and cross-domain attack vectors, such as attacks
on sovereign oceanic-based datacenters. Our system integrates AI safety
validation, biosecurity surveillance, cybersecurity intelligence, and privacy-
preserving coordination within a unified framework to anticipate kill chain
stages and proactively prepare with research-grounded mitigation steps, where
traditional security paradigms prove insufficient.
Read More
Read More
Nov 24, 2025
Opening Doors to Multimodal Deception
Organizations increasingly deploy vision-language models (VLMs) locally to protect sensitive data, yet there is almost no empirical work on whether these multimodal systems can systematically lie about visual content or whether text-based safety tools transfer to visual contexts. We present, to our knowledge, the first systematic study of deception detection in VLMs. First, we construct a VLM deception benchmark with 1,048 image–text pairs derived from COCO and show that a standard VLM can be reliably induced to contradict ground-truth visual information, while a linear probe on hidden activations detects these multimodal lies with 98.7% AUROC. Second, we perform a geometric analysis of deception representations across a text-only LLM and a VLM, finding high alignment (cosine similarity > 0.85) between deception directions in the text backbone and in the VLM text stream, suggesting that deception is encoded as a modality-independent concept rather than a purely task-specific feature. Third, we demonstrate cross-modal transfer for deception detection: a probe trained only on text-based deception data attains 78.5% AUROC on visual deception, corresponding to 57% transfer efficiency. These results indicate that text-trained probes can provide useful zero-shot monitoring for local VLM deployments, while also highlighting open questions about modality-specific failure modes and the limits of cross-modal safety transfer.
Read More
Read More
Nov 24, 2025
Gene Guard: Real-Time Genomic Data Leak Prevention
Genomic data (such as DNA sequences) are both highly sensitive and uniquely vulnerable to accidental leakage in modern collaborative and AI-assisted workflows. This report introduces Gene-Guard, a two-layer detection and prevention system designed to safeguard genomic sequences from unauthorized exposure in real time. We articulate the growing risk through real-world scenarios: researchers inadvertently pasting DNA sequences into chat platforms or AI tools, and bio-lab staff emailing unencrypted gene data. To address this gap, Gene-Guard integrates an open DNA sequence dataset for training and benchmarking machine learning classifiers that flag genomic data with high recall and precision. Upon detection, an optional second layer provides an LLM-powered risk analysis, identifying the sequence and its potential biosecurity implications using tools like IBBIS’s Common Mechanism and SeqScreen. Emphasis is placed on a lightweight, fast, platform-agnostic tool that can plug into any Data Loss Prevention pipeline without heavy resource demands or latency. Our results suggest Gene-Guard can significantly reduce the chance of genomic data leaks without impeding researchers’ workflows, offering a novel safety net at the intersection of cyber and biosecurity.
Read More
Read More
Nov 24, 2025
TinyRod
Phishing is still the easiest way into most organisations, despite years of machine-learning filters boasting 95%+ accuracy on benchmark datasets. Those models usually live in centralised paid services that demand full access to email content, create single points of failure, and still struggle with fast-moving, AI-generated content and landing pages.
TinyRod is a proof-of-concept email security tool that asks a simple question: how far can we push phishing detection if everything runs locally in the user’s browser? The project couples a Chrome extension with small open-source language models running via WebLLM, so email content never leaves the browser. When a user clicks “scan”, TinyRod pulls structured metadata from the page, feeds it to an SLM, and returns a 0–100 risk score plus a small, fixed list of “red flag” reasons. That keeps the experience close to how people already use email.
The result is a hackathon-built but realistic architecture that shows SLMs in the ~1.7B–8B range can meaningfully support phishing defence at the edge, with clear privacy trade-offs, a documented threat model, and an obvious path for hardening and future work.
Read More
Read More
Nov 24, 2025
Zero Trust Agency: Mitigating the Confused Deputy in Autonomous AI Systems
This project addresses the critical "Confused Deputy" vulnerability in Agentic AI, where agents operating with static "God-mode" permissions are hijacked via prompt injection to attack enterprise systems. We developed a Zero Trust Architecture that replaces these static credentials with Identity Propagation via RFC 8693 (Token Exchange). This compels agents to cryptographically "borrow" the human user's identity ("On-Behalf-Of") for every interaction, ensuring they can only access resources explicitly authorized for that user. Our MVP simulation demonstrated that this architecture deterministically blocks 100% of privilege escalation and lateral movement attempts by enforcing hard permission boundaries at the infrastructure layer, rendering prompt injection attacks ineffective against unauthorized tools.
Read More
Read More
Apart Sprint Pilot Experiments
Nov 24, 2025
TEEs as a Cryptographic Nervous System for Onshored Humanoid Robots
If cheap, capable humanoid robots are deployed at scale and connected to powerful AI models, the central risk is not only “misaligned agents” but losing reliable control over physical actuators. As humanoids move from lab curiosities to industrial and domestic platforms, even a single exploit, backdoor, or bad update could repurpose fleets of robots at once. Recent experiments have shown that robots driven by large language models (LLMs) can be induced, via prompt‑level attacks, to perform dangerous actions despite software‑level safety rules. At the same time, governments and industry are openly planning for mass deployment of humanoids into manufacturing and care settings, with national strategies explicitly framing them as a pillar of future industrial capacity.
We explore an architecture that treats robots as cryptographic devices whose ability to move is gated by trusted execution environments (TEEs), on‑chain registries, and onshored manufacturing. Concretely, we assume a two‑layer control stack: a high‑frequency local controller (potentially a learned policy trained in high‑fidelity simulators such as NVIDIA Isaac Sim) closes the inner loop, while larger vision‑language‑action (VLA) models issue lower‑frequency plans or subgoals. Both layers are attested; actuators only accept commands that originate from code whose measurement matches an approved manifest and that pass through a TEE‑enforced “cryptographic nervous system.” We do not attempt to prove that any given policy is aligned. Instead, we enforce that only attested controller code, using a specific verifiable configuration, can authorize motion—and that only robots listed in a Robot Registry may run models listed in a Model Registry.
As a first step we built a self‑hosted confidential‑compute testbed using dstack (https://vmm.agioh.io/) and a TEE‑backed key‑management service (https://kms.agioh.io/) anchored on the Base chain. On top of this we implemented an attested model runtime (neko_agent), high‑performance WebRTC bindings for low‑latency control, and a deterministic minimal OS (sixos) for reproducible builds. On the hardware side, we designed a dual‑arm humanoid platform that uses forged carbon‑fiber composites and 3D‑printed tooling for rapid, onshore, low‑capex manufacturing that could eventually be partially automated by robots themselves—while remaining under TEE‑enforced control.
Keywords: humanoid robots, trusted execution environments, defensive acceleration, smart contracts, forged carbon fiber, secure remote control
Read More
Nov 24, 2025
WikiGen: Bio‑logical safeguards for collaborative AI/ML on sensitive data
Looking for data, model improvements, ie.. cures; WikiGen is an open sourced protocol connecting databases and a user’s data to allow selective consensus for a given inquiry, based on collective and/or private knowledge feeds. Particular to you, WikiGen evaluates a query, protocol or research objective, even an entire database submission against our internal base of researchers, commercial to academic partners and concerned healthcare consumers’ data repos. to securely and with privacy-first preserving means, discover missing components that exist already or possibilities for data collection campaigns as solutions (similar to bounties used in hacks). This way science, may not have to be repeated locally for the same vain, in vain, but rather shared to save time or money in R&D globally across. Animal model systems may not be needed to run experiments on, if the data exists already or if there is enough to simulate such animal models. Perhaps, the instrument you are about to optimize for a lab assay has been, before, performed exactly for your exact experiment constrains, however, not published or digitally accessible to your very credentials. Maybe you are a licensed MD or RN looking for insulin ASAP, or the rates of Asthma in children changing specific to your zip code. If the information is collected and exists in a database, all you need is a trustless way to request for it and prove you can be trusted!
Currently, there is no way for researchers to “google” i.e. BLAST search align a SEQ file (GCAT to CGTA similarity) to locate plasmids or biological materials off of the input compliment target by percent homology of sequence (0-100% exact mach) overlap in data bases they have no authority to search in. Well yes.. for safety. However, in life science these databases are constantly being shared, in correctly logged, mailed and collaborated on. The models, benchmarks, scripts, standard operating procedure protocols and barcoded physical inventory carefully catalogued. How do we connect such very difficult to acquire, in the first place, tools and resources that may hold specific answers to someone else’s difficult question, if the data is collected and not shared, reused, or even not allowed to be open sourced. Such that If I work at BioMarin, I do not need to have friends in Amgen’s gene therapy group to know what sample plasmids they possibly may have in their freezer repertoire for their own IP campaigns, to be able to determine who to email for a sample. I'd much rather the void alert me of what I do not know, which I need based off of my current needs.
Researchers and ML/AI agentic pipelines today require new methods to access data. Especially sensitive data that can be used for harm, without stunting the goodness that arrises from research aims, when utilized for the intended purposes. In antibody discovery, neuroscience research, oncology targets and in metabolomics, sequenced library sets are almost barely ever publicly added to open source gene banks. Publications in biology, such as Nature have priorly risked the identities of patients by publishing tissue sequence of sc-RNA matrices data, which other individuals were able to de-identify the patients. Publishing costs money and occasionally you get a very mean or confused reviewer, preventing critical relay channels to share important data or protocols in a heist matter. Lastly, bioinformatic or computational biology data scientists, unite in complaints of code not published to the paper, or not the right files or script versions publicly available. Institutions or biopharma corporations are only held accountable to incentivize cross dynamic synergy benefits to their entity's already self involved and financed collaborations, even when policy and laws exist such as to share failed clinical trial data (the U.S. states as law), they fail in implementation : clinicaltrials.gov does not support, nor has anyone significant noticed this is not being maintained, since 2013, for missing reports of clinical trial master files to be required for upload into an active registry for the, permission approved, collective to make use.
Read More
Nov 24, 2025
Helix-Aegis: LLM Based screening for bio-sequences
We introduce Helix-Aegis, a prototype defensive screening system designed to
detect hazardous biological sequences (toxins, pathogens, virulence factors)
using fine-tuned Large Language Models. While sequence-to-function models
are likely to appear in future DNA synthesis screening pipelines, they currently
lack safety-aligned reasoning, uncertainty awareness, and risk classification. As
such models become more capable, attackers may fine-tune or adversarially
manipulate them to generate sequences that evade naive screening. Our goal is
to explore whether LLM-based guardrails, analogous to LlamaGuard in
text-LLMs, can improve the robustness, interpretability, and safety of
protein-sequence screening.
Read More
Nov 24, 2025
Detecting Piecewise Cyber Espionage in Model APIs
Detecting Piecewise Cyber Espionage in Model APIs -
On November 13th 2025, Anthropic published
a report on an AI-orchestrated cyber espionage
campaign. Threat actors used various tech-
niques to circumvent model safeguards and
used Claude Code with agentic scaffolding
to automate large parts of their campaigns.
Specifically, threat actors split campaigns into
subtasks that in isolation appeared benign. It is
therefore important to find methods to protect
against such attacks to ensure that misuse of
AI in the cyber domain can be minimized.
To address this, we propose a novel method
for detecting malicious activity in model
APIs across piecewise benign requests. We
simulated malicious campaigns using an
agentic red-team scaffold similar to what
was described in the reported attack. We
demonstrate that individual attacks are blocked
by simple guardrails using Llama Guard 3.
Then, we demonstrate that by splitting up the
attack, this bypasses the guardrail. For the
attacks that get through, a classifier model
is introduced that detects those attacks, and
a statistical validation of the detections is
provided. This result paves the way to detect
future automated attacks of this kind.
Read More
Nov 24, 2025
Aegis Sentinel Multi-Domain Defensive Acceleration Platform for Critical Infrastructure Protection
We present Aegis Sentinel, a defensive acceleration (simulation and analysis)
platform addressing the convergence of AI-enabled threats, critical
infrastructure vulnerabilities, and cross-domain attack vectors, such as attacks
on sovereign oceanic-based datacenters. Our system integrates AI safety
validation, biosecurity surveillance, cybersecurity intelligence, and privacy-
preserving coordination within a unified framework to anticipate kill chain
stages and proactively prepare with research-grounded mitigation steps, where
traditional security paradigms prove insufficient.
Read More
Nov 24, 2025
Opening Doors to Multimodal Deception
Organizations increasingly deploy vision-language models (VLMs) locally to protect sensitive data, yet there is almost no empirical work on whether these multimodal systems can systematically lie about visual content or whether text-based safety tools transfer to visual contexts. We present, to our knowledge, the first systematic study of deception detection in VLMs. First, we construct a VLM deception benchmark with 1,048 image–text pairs derived from COCO and show that a standard VLM can be reliably induced to contradict ground-truth visual information, while a linear probe on hidden activations detects these multimodal lies with 98.7% AUROC. Second, we perform a geometric analysis of deception representations across a text-only LLM and a VLM, finding high alignment (cosine similarity > 0.85) between deception directions in the text backbone and in the VLM text stream, suggesting that deception is encoded as a modality-independent concept rather than a purely task-specific feature. Third, we demonstrate cross-modal transfer for deception detection: a probe trained only on text-based deception data attains 78.5% AUROC on visual deception, corresponding to 57% transfer efficiency. These results indicate that text-trained probes can provide useful zero-shot monitoring for local VLM deployments, while also highlighting open questions about modality-specific failure modes and the limits of cross-modal safety transfer.
Read More
Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Newsletter
Jun 17, 2025
Apart: Fundraiser Extended!
We've received another $462,276 since our last newsletter, making the total $597k of our $955k goal!
Read More


Our Impact
Research
Jul 25, 2025
Problem Areas in Physics and AI Safety
We outline five key problem areas in AI safety for the AI Safety x Physics hackathon.
Read More


Newsletter
Jul 11, 2025
Apart: Two Days Left of our Fundraiser!
Last call to be part of the community that contributed when it truly counted
Read More


Newsletter
Jun 17, 2025
Apart: Fundraiser Extended!
We've received another $462,276 since our last newsletter, making the total $597k of our $955k goal!
Read More



Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events