Feb 2, 2026

Self-Governance Under Revision

Daniel Polak, Ajay Agarwal

Frontier AI labs have published voluntary safety frameworks committing to evaluate dangerous capabilities and implement safeguards before deployment. These documents, including Anthropic's Responsible Scaling Policy, OpenAI's Preparedness Framework, and Google DeepMind's Frontier Safety Framework, are often cited as evidence of responsible self-governance, yet can be revised at any time without external approval.

We developed a taxonomy of commitment changes and applied it to over 60 revisions across three major labs, coding direction (strengthening, weakening, neutral), mechanism, and changelog disclosure. We find sharp divergence: OpenAI weakened 19 commitments with zero strengthenings; Anthropic was roughly balanced (11 weakenings, 10 strengthenings); and DeepMind net strengthened (4 weakenings, 11 strengthenings). Strengthenings were twice as likely to be omitted from changelogs as weakenings (68% vs. 38%), and all three labs weakened evaluation-frequency commitments. Most weakenings reduced oversight-relevant properties: 68% affected external accountability and 56% reduced measurability.

Overall, framework evolution varies substantially by lab, while official communications systematically underreport commitment reductions. Our taxonomy and dataset enable ongoing monitoring and inform policymakers on which commitments may require regulatory protection rather than voluntary maintenance.

Review Project

See Code

View Related Sprint

Reviewer's Comments

I like what this paper is doing: treating safety frameworks like version-controlled code and actually diffing them. That's the right instinct. The OpenAI numbers (19 weakenings, 0 strengthenings) jump off the page, and the universal weakening of evaluation frequency across all three labs is the kind of finding that should make regulators nervous.

The biggest weakness is that one person coded all 62 changes. Classification like this is judgment-heavy ("is this a weakening or just a clarification?"), and without a second coder checking at least a sample, readers have to take the authors' word for it. That's a fixable problem, and fixing it would go a long way.

There's also a weird result buried in the data: labs *omit* their own improvements from changelogs more often than they omit weakenings. Why? Are they sandbagging their own good news? Is it just that positive changes feel less "newsworthy" internally? The paper flags this but doesn't dig into it, and it's maybe the most interesting thread to pull on.

One last thing: the paper tells us what's broken but doesn't really propose a fix. A concrete spec for a machine-readable changelog format (even a rough draft) would turn this from "here's a problem" into "here's a problem and here's what to do about it."

A simple idea, executed and presented well.

Cite this work

@misc {

title={

(HckPrj) Self-Governance Under Revision

author={

Daniel Polak, Ajay Agarwal

date={

2/2/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026