Mar 10, 2025
Debugging Language Models with SAEs
Wen Xing
This report investigates an intriguing failure mode in the Llama-3.1-8B-Instruct model: its inconsistent ability to count letters depending on letter case and grammatical structure. While the model correctly answers "How many Rs are in BERRY?", it struggles with "How many rs are in berry?", suggesting that uppercase and lowercase queries activate entirely different cognitive pathways.
Through Sparse Autoencoder (SAE) analysis, feature activation patterns reveal that uppercase queries trigger letter-counting features, while lowercase queries instead activate uncertainty-related neurons. Feature steering experiments show that simply amplifying counting neurons does not lead to correct behavior.
Further analysis identifies tokenization effects as another important factor: different ways of breaking very similar sentences into tokens influence the model’s response. Additionally, grammatical structure plays a role, with "is" phrasing yielding better results than "are."
Zainab Majid
Interesting research direction - well done! This paper presents an innovative approach to an interesting issue - how capitalisations and grammar in prompts can impact results. I would suggest carrying out a thorough literature review to understand the problem space in a lot of detail. I'd also think carefully about AI safety risks and what token inconsistencies could mean. This is an interesting research direction and could be made stronger by engaging with the literature and potential impacts on the safety space.
Cite this work
@misc {
title={
@misc {
},
author={
Wen Xing
},
date={
3/10/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}