This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Reprogramming AI Models Hackathon
6710eab8447f62cdea3a653c
Reprogramming AI Models Hackathon
November 25, 2024
Accepted at the 
6710eab8447f62cdea3a653c
 research sprint on 

Unveiling Latent Beliefs Using Sparse Autoencoders

Language models (LMs) often generate outputs that are linguistically plausible yet factu- ally incorrect, raising questions about their internal representations of truth and belief. This paper explores the use of sparse autoencoders (SAEs) to identify and manipulate features that encode the model’s confidence or belief in the truth of its answers. Using GoodFire AI’s powerful API tools of semantic and contrastive search methods, we uncover latent fea- tures associated with correctness and accuracy in model responses. Experiments reveal that certain features can distinguish between true and false statements, while others serve as controls to validate our approach. By steering these belief-associated features, we demon- strate the ability to influence model behavior in a targeted manner, improving or degrading factual accuracy. These findings have implications for interpretability, model alignment, and enhancing the reliability of AI systems.

By 
Carlos Cortez, Eivind Otto Hjelle, Sanchit Kalhan
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private