This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Computational Mechanics Hackathon!
661eead4df76057e22a47ca8
Computational Mechanics Hackathon!
June 3, 2024
Accepted at the 
661eead4df76057e22a47ca8
 research sprint on 

Steering Model’s Belief States

Recent results in Computational Mechanics show that Transformer models trained on Hidden Markov Models (HMM) develop a belief state geometry in their residuals streams, which they use to keep track of the expected state of the HMM, to be able to predict the next token in a sequence. In this project, we explored how steering can be used to induce a new belief state and hence alter the distribution of predicted tokens. We explored a traditional difference-in-means approach, using the activations of the models to define the belief states. We also used a smaller dimensional space that encodes the theoretical belief state geometry of a given HMM and show that while both methods allow to steer models’ behaviors the difference-in-means approach is more robust.

By 
Gonçalo Paulo, Sinem Erisken, Tassilo Neubauer
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission is under review.
Oops! Something went wrong while submitting the form.

This project is private