Mar 10, 2025
Mechanistic Interpretability Track: Neuronal Pathway Coverage
Garance Colomer, Luc Chan, Sacha Lahlou, Leina Corporan Miath
Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.
Liv Gorton
Studying the representation of different groups of people (e.g. representation of different political affiliations) in LLMs is a very interesting question! A more detailed methods section (e.g. show the prompts fed to the model at each step; a figure with an example of a tweet, the feature activations, and then the classification) would help make some of the specifics easier to follow.
Cite this work
@misc {
title={
@misc {
},
author={
Garance Colomer, Luc Chan, Sacha Lahlou, Leina Corporan Miath
},
date={
3/10/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}