Nov 25, 2024

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

Daniel Davies, Ashwarya Maratha

Details

Details

Arrow
Arrow
Arrow
Arrow
Arrow

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

Cite this work:

@misc {

title={

@misc {

},

author={

Daniel Davies, Ashwarya Maratha

},

date={

11/25/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.