Mar 23, 2026

Tokenization Drift as a Hidden Channel for Alignment Evasion in Language Models

Raul Cavalcante, Bruno Lopes Yamamoto

RL training pipelines that serialize model outputs to text and retokenize them to construct training contexts introduce a token-level asymmetry invisible at the string level. We show this asymmetry is a plausible channel for alignment evasion: a model under opposing training pressures can learn to detect whether it is in deployment or testing and act differently, without any semantic cue or explicit reasoning. When a model's autoregressive output is decoded to a string and re-encoded, the resulting token sequence differs from the original 4–8% of the time across three model families (LFM 1.2B, Llama 1B, Qwen 0.6B). A linear probe on frozen hidden states confirms the model's representations distinguish original (deployment) from retokenized (testing) contexts. Under RL training with a pathwise KL objective, a 1.2B-parameter model learns to assign systematically different likelihoods to the same continuation depending on which token path produced the context, constituting deployment-versus-testing behavioral divergence conditioned on a signal no text-level audit can detect.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Seems pretty cool to me. I'm not qualified to give a take here though as this isn't really control in the typical sense. This does sound like a very important finding if true and means that we need to be careful when doing alignment checks or training to encode the text in a way that is indistinguishable from what it will encounter in deployment.

This also seems like a plausibly important consideration when doing honey potting which is a common approach that comes up in control (e.g., for untrusted monitoring).

This was quite a novel idea! the premise seems quite feasible, and basis for a full paper. The experiments themselves were limited in proving this effect, though this is not that surprising given the limited timespan.

Cite this work

@misc {

title={

(HckPrj) Tokenization Drift as a Hidden Channel for Alignment Evasion in Language Models

},

author={

Raul Cavalcante, Bruno Lopes Yamamoto

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.