Mar 23, 2026
Tokenization Drift as a Hidden Channel for Alignment Evasion in Language Models
Raul Cavalcante, Bruno Lopes Yamamoto
RL training pipelines that serialize model outputs to text and retokenize them to construct training contexts introduce a token-level asymmetry invisible at the string level. We show this asymmetry is a plausible channel for alignment evasion: a model under opposing training pressures can learn to detect whether it is in deployment or testing and act differently, without any semantic cue or explicit reasoning. When a model's autoregressive output is decoded to a string and re-encoded, the resulting token sequence differs from the original 4–8% of the time across three model families (LFM 1.2B, Llama 1B, Qwen 0.6B). A linear probe on frozen hidden states confirms the model's representations distinguish original (deployment) from retokenized (testing) contexts. Under RL training with a pathwise KL objective, a 1.2B-parameter model learns to assign systematically different likelihoods to the same continuation depending on which token path produced the context, constituting deployment-versus-testing behavioral divergence conditioned on a signal no text-level audit can detect.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Tokenization Drift as a Hidden Channel for Alignment Evasion in Language Models
},
author={
Raul Cavalcante, Bruno Lopes Yamamoto
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


