Nov 25, 2024
Unveiling Latent Beliefs Using Sparse Autoencoders
Carlos Cortez, Eivind Otto Hjelle, Sanchit Kalhan
Summary
Language models (LMs) often generate outputs that are linguistically plausible yet factually incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features
that encode the model’s confidence or belief in the truth of its answers. Using GoodFire
AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea-
tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demonstrate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.
Cite this work:
@misc {
title={
Unveiling Latent Beliefs Using Sparse Autoencoders
},
author={
Carlos Cortez, Eivind Otto Hjelle, Sanchit Kalhan
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}