Mar 21, 2026
Operator Dynamics in Transformer Residual Streams: A Unified Framework for Interpretability, Adversarial Detection, Causal Control, and Topological Model Fingerprinting
Sanskar Pandey
This project proposes a process-level monitoring signal for AI control based on internal transformer dynamics. Instead of relying on outputs or linguistic features, we analyze residual stream operator trajectories, the inter-layer updates that capture how a model computes.
We show that these operator trajectories form a low-dimensional structure that can be used as a monitor for unsafe or abnormal computation. Deviations from normal trajectories provide a zero-shot anomaly signal that detects adversarial inputs (GCG suffixes, jailbreaks, prompt injections) without requiring any adversarial training, achieving strong performance across multiple models.
This signal can be integrated into a control protocol as a runtime monitor. A trusted system can flag or intervene when the model’s internal computation deviates from expected patterns, even when the output appears benign. We further show that the same representation enables semantic intervention and neuron-level attribution, suggesting a unified framework for both monitoring and control.
We validate this approach across four models (GPT-2, Qwen2.5, Mistral-7B, Qwen3.5-9B), demonstrating that internal trajectory-based monitoring generalizes across architectures and scales.
The goal of this work is to shift control from output-level filtering to process-level oversight, where unsafe behavior can be detected and mitigated by monitoring how the model is computing, not just what it produces.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Operator Dynamics in Transformer Residual Streams: A Unified Framework for Interpretability, Adversarial Detection, Causal Control, and Topological Model Fingerprinting
},
author={
Sanskar Pandey
},
date={
3/21/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


