Mar 23, 2026
Tokenization Drift as a Hidden Channel for Alignment Evasion in Language Models
Raul Cavalcante, Bruno Lopes Yamamoto
RL training pipelines that serialize model outputs to text and retokenize them to construct training contexts introduce a token-level asymmetry invisible at the string level. We show this asymmetry is a plausible channel for alignment evasion: a model under opposing training pressures can learn to detect whether it is in deployment or testing and act differently, without any semantic cue or explicit reasoning. When a model's autoregressive output is decoded to a string and re-encoded, the resulting token sequence differs from the original 4–8% of the time across three model families (LFM 1.2B, Llama 1B, Qwen 0.6B). A linear probe on frozen hidden states confirms the model's representations distinguish original (deployment) from retokenized (testing) contexts. Under RL training with a pathwise KL objective, a 1.2B-parameter model learns to assign systematically different likelihoods to the same continuation depending on which token path produced the context, constituting deployment-versus-testing behavioral divergence conditioned on a signal no text-level audit can detect.
Seems pretty cool to me. I'm not qualified to give a take here though as this isn't really control in the typical sense. This does sound like a very important finding if true and means that we need to be careful when doing alignment checks or training to encode the text in a way that is indistinguishable from what it will encounter in deployment.
This also seems like a plausibly important consideration when doing honey potting which is a common approach that comes up in control (e.g., for untrusted monitoring).
This was quite a novel idea! the premise seems quite feasible, and basis for a full paper. The experiments themselves were limited in proving this effect, though this is not that surprising given the limited timespan.
Cite this work
@misc {
title={
(HckPrj) Tokenization Drift as a Hidden Channel for Alignment Evasion in Language Models
},
author={
Raul Cavalcante, Bruno Lopes Yamamoto
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


