Mar 23, 2026
Denoising Trajectories as Early-Warning Signals for Diffusion Language Model Control
Seunghyun Yoo
We study whether denoising trajectories in diffusion language models can serve as early warning signals for AI control. Using LLaDA with pretrained sparse autoencoders in an executable shell setting, we find trajectory-based monitoring detects attacks much earlier than output-only oversight, while steering shows weaker but preliminary promise for localized intervention.
Using intermediary representations for monitoring in control and using steering is potentially interesting for control. Using diffusion LMs seems somewhat novel to me, however its difficult to see how this research is useful unless the most frontier models are also diffusion LMs.
This would have been stronger if run in a control environment like BigCodeBench or APPS (both are single turn). It's also hard to tell if the SAEs are actually classifying scheming or the input prompt. Also I didn't see many statistical tests / p values and I don't see why safety is unchanged using output only monitor vs no monitor. Both would have made this paper stronger.
Cite this work
@misc {
title={
(HckPrj) Denoising Trajectories as Early-Warning Signals for Diffusion Language Model Control
},
author={
Seunghyun Yoo
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


