This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Women in AI Safety Hackathon
679781551b57b97e23660edd
Women in AI Safety Hackathon
March 10, 2025
Accepted at the 
679781551b57b97e23660edd
 research sprint on 

Debugging Language Models with SAEs

This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways. Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior. Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."

By 
Wen Xing
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private