Mar 22, 2026

When Safety Becomes the Vulnerability

Caleb Rudnick, August Lina

Any base64-encoded string included in a message to the Claude API causes that request to fail. This behaviour is reproducible across Sonnet and Opus 4.6, and across all platforms including the API, web and mobile applications, and Claude Code. While the blanket rejection of base64 content was originally a reasonable defence against prompt injection—attackers could encode malicious instructions to bypass keyword-based safety filters—the measure has become a liability as large language models have moved from conversational assistants to critical infrastructure components.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

the report is very well written and presented to the level of a good conference paper.

im a bit skeptical of the threat model (base64-injection attacks causing critical system failures due to hitting refusals from the model provider). I feel like systems should be basically robust to model outages and base64 attacks might be a specific cause of outages. My guess is there might be narrow situations in which an attack can strategically cause a model outage leading to a security vulnerability, but that this isn't a super important threat vector.

FYI I tried running a simple example from the report (aGVsbG8gd29ybGQ= “hello world”) in claude.ai and it did not trigger a refusal (the model just understood the message and responded appropriately).

I agree with the broad dynamic:

>This is a concrete instance of a broader principle in AI control research: that safety mechanisms which

fail catastrophically (by crashing the system or returning no response) rather than degrading

gracefully (by flagging suspicious content and continuing) can be weaponised by adversaries

who understand the failure mode.

But I think this type of threat should just be caught by standard testing and security protocols and isn't a major threat model. I think it's good the report points out the dynamic.

Cite this work

@misc {

title={

(HckPrj) When Safety Becomes the Vulnerability

},

author={

Caleb Rudnick, August Lina

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.