Keep Apart Research Going: Donate Today
Summary
Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.
Cite this work:
@misc {
title={
Steering Model’s Belief States
},
author={
Gonçalo Paulo, Sinem Erisken, Tassilo Neubauer
},
date={
6/3/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}