Nov 25, 2024
Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types
Daniel Davies, Ashwarya Maratha
Summary
We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.
Cite this work:
@misc {
title={
Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types
},
author={
Daniel Davies, Ashwarya Maratha
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}