Mar 10, 2025

Mechanistic Interpretability Track: Neuronal Pathway Coverage

Garance Colomer, Luc Chan, Sacha Lahlou, Leina Corporan Miath

Details

Details

Arrow
Arrow
Arrow

Summary

Our study explores mechanistic interpretability by analyzing how Llama 3.3 70B classifies political content. We first infer user political alignment (Biden, Trump, or Neutral) based on tweets, descriptions, and locations. Then, we extract the most activated features from Biden- and Trump-aligned datasets, ranking them based on stability and relevance. Using these features, we reclassify users by prompting the model to rely only on them. Finally, we compare the new classifications with the initial ones, assessing neural pathway overlap and classification consistency through accuracy metrics and visualization of activation patterns.

Cite this work:

@misc {

title={

Mechanistic Interpretability Track: Neuronal Pathway Coverage

},

author={

Garance Colomer, Luc Chan, Sacha Lahlou, Leina Corporan Miath

},

date={

3/10/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Review

Review

Arrow
Arrow
Arrow
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.