Jun 13, 2025

Four Paths to Failure: Red Teaming ASI Governance

Luca De Leo, Zoé Roy-Stang, Heramb Podar, Damin Curtis, Vishakha Agrawal, Ben Smyth

We stress‑tested A Narrow Path Phase 0—the proposed 20‑year moratorium on training artificial super‑intelligence (ASI)—during a one‑day red‑teaming hackathon. Drawing on rapid literature reviews, historical analogues (nuclear, bioweapon, cryptography, and export‑control regimes), and rough‑order cost modelling, we examined four cornerstone safeguards: datacentre compute caps, training‑licence thresholds, use‑ban triggers, and 12‑day breakout‑time detection logic.

Our analysis surfaced four realistic circumvention routes that an adversary could pursue while remaining nominally compliant:

Jurisdictional arbitrage via non‑signatory “AI‑haven” states.

Distributed consumer‑GPU swarms operating below per‑node licence limits.

Covert runs on classified state super­computers hidden behind national‑security carve‑outs.

Offshore proxy datacentres protected by host‑state sovereignty and inspection vetoes.

Each path could power GPT‑4‑scale training within 6–18 months, undermining the moratorium’s objectives.

To close these gaps, we propose ten mutually reinforcing amendments, including universal cryptographic “compute passports,” quarterly‑updating compute thresholds, an International AI Safety Commission with secondary‑sanctions powers, HSM‑bound weights, kill‑switch performance bonds, whistle‑blower bounties, whole‑network swarm detection, conditional infrastructure aid, and a 30‑month sunset‑and‑iteration clause. Implemented as a single package, these measures transform Phase 0 from a jurisdiction‑bounded freeze into a verifiable, adaptive global regime that blocks every breach path identified while permitting licensed low‑risk research.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Tolga Bilge

Great paper! I have reservations about adding sunset provisions.

Dave Kasten

First off, I appreciate the explicit choice to say " We deliberately favoured substance critiques—those that treat the

policies as flawlessly implemented—and set aside feasibility or implementation questions". This is the heart of the exercise! Thank you.

The discussion of regulatory flight is particularly thoughtful, especially with regards to the specific arguments about timeframes for moving to other countries and examples (we were already familiar with the Soviet bioweapons and Pakistani nuclear program examples and think they are some of the strongest bits of evidence against A Narrow Path, but thinking of FTX as an example of this is particularly inspired.)

I think the datacentre discussion is also good, though doesn't wrestle with the core question readers are left with: why aren't players in the space doing this already, and how quickly would the optimization pressure of our polices move the companies to overcome the (real but not insurmountable) technical hurdles. The compute passports idea in this context is intriguing, however. The discussion of other ways to identify sufficient criminal suspicion for a search (e.g., power usage or heat emissions, which is in line with how many jurisdictions detect e.g. unlawful marijuana grow houses today) is also thoughtful and better-articulated than most examples I've seen.

The offshore discussion is good, though it immediately made me think of examples of rendition during the US Global War on Terror that I wish you had brought in as well.

The 10 ideas are quite good as well; some of these mirror our proposals (e.g., the IASC inspections) but the specifics of how to make them work are just really sharp.

Overall, this hits the mark of giving me specific additional things that 1. I need to think about to find things to patch and 2. specific additional sentences that I can already use in discussions with policymakers to answer a "how might this work" question, driven by examples.

Cite this work

@misc {

title={

(HckPrj) Four Paths to Failure: Red Teaming ASI Governance

},

author={

Luca De Leo, Zoé Roy-Stang, Heramb Podar, Damin Curtis, Vishakha Agrawal, Ben Smyth

},

date={

6/13/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.