Mar 21, 2026
Operator Dynamics in Transformer Residual Streams: A Unified Framework for Interpretability, Adversarial Detection, Causal Control, and Topological Model Fingerprinting
Sanskar Pandey
This project proposes a process-level monitoring signal for AI control based on internal transformer dynamics. Instead of relying on outputs or linguistic features, we analyze residual stream operator trajectories, the inter-layer updates that capture how a model computes.
We show that these operator trajectories form a low-dimensional structure that can be used as a monitor for unsafe or abnormal computation. Deviations from normal trajectories provide a zero-shot anomaly signal that detects adversarial inputs (GCG suffixes, jailbreaks, prompt injections) without requiring any adversarial training, achieving strong performance across multiple models.
This signal can be integrated into a control protocol as a runtime monitor. A trusted system can flag or intervene when the model’s internal computation deviates from expected patterns, even when the output appears benign. We further show that the same representation enables semantic intervention and neuron-level attribution, suggesting a unified framework for both monitoring and control.
We validate this approach across four models (GPT-2, Qwen2.5, Mistral-7B, Qwen3.5-9B), demonstrating that internal trajectory-based monitoring generalizes across architectures and scales.
The goal of this work is to shift control from output-level filtering to process-level oversight, where unsafe behavior can be detected and mitigated by monitoring how the model is computing, not just what it produces.
Investigating residual layer differences as a signal for interpretability and monitoring looks promising and deserves more work. The biggest improvement to be made here is to distill the core idea down to 4 pages plus appendix and really zoom in on the strongest application of this idea instead of presenting every possible application in one go. The zero-shot adversarial detection result seems like a good candidate for a really strong contribution and would benefit from deeper validation (larger sample sizes, adaptive adversarial evaluation, comparison to existing hidden-state detection methods, …).
Cite this work
@misc {
title={
(HckPrj) Operator Dynamics in Transformer Residual Streams: A Unified Framework for Interpretability, Adversarial Detection, Causal Control, and Topological Model Fingerprinting
},
author={
Sanskar Pandey
},
date={
3/21/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


