This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Reprogramming AI Models Hackathon
6710eab8447f62cdea3a653c
Reprogramming AI Models Hackathon
November 25, 2024
Accepted at the 
6710eab8447f62cdea3a653c
 research sprint on 

Explaining Latents in Turing-LLM-1.0-254M with Pre-Defined Function Types

We introduce a novel framework for explaining latents in the Turing-LLM-1.0-254M model based on a predefined set of function types, allowing for a more human-readable “source code” of the model’s internal mechanisms. By categorising latents using multiple function types, we move towards mechanistic interpretability that does not rely on potentially unreliable explanations generated by other language models. Evaluation strategies include generating unseen sequences using variants of Meta-Llama-3-8B-Instruct provided by GoodFire AI to test the activation patterns of latents, thereby validating the accuracy and reliability of the explanations. Our methods successfully explained up to 95% of a random subset of latents in layers, with results suggesting meaningful explanations were discovered.

By 
Daniel Davies, Ashwarya Maratha
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private