Nov 11, 2022
-
Nov 13, 2022
The Interpretability Hackathon
48 hours intense research in interpretability, the modern AI neuroscience.
This event is ongoing.
This event has concluded.
Hosted by Esben Kran, Apart Research, Neel Nanda
Join this AI safety hackathon to find new perspectives on the "brains" of AI!

We will work with multiple research directions and one of the most relevant is mechanistic interpretability. Here, we work towards a researcher understanding of neural networks and why they do what they do.

We provide you with the best starter templates that you can work from so you can focus on creating interesting research instead of browsing Stack Overflow. You're also very welcome to check out some of the ideas already posted!
How to participate
Create a user on the itch.io (this) website and click participate. We will assume that you are going to participate and ask you to please cancel if you won't be part of the hackathon.
Instructions
You will work on research ideas you generate during the hackathon and you can find more inspiration below.
Submission
Everyone will help rate the submissions together on a set of criteria that we ask everyone to follow. You receive 5 projects that you have to rate during the 4 hours of judging before you can judge specific projects (so we avoid selectivity in the judging).
Inspiration
We have many ideas available for inspiration on the aisi.ai Interpretability Hackathon ideas list. A lot of interpretability research is available on distill.pub, transformer circuits, and Anthropic's research page.
Introductions to mechanistic interpretability
Catherine Olsson's on getting starter with mechanistic interpretability research
A video walkthrough of A Mathematical Framework for Transformer Circuits.
Jacob Hilton's deep learning curriculum week on interpretability
An annotated list of good interpretability papers, along with summaries and takes on what to focus on.
Neel's barebones prerequisites for mechanistic interpretability research
See also the tools available on interpretability:
Redwood Research's interpretability tools: http://interp-tools.redwoodresearch.org/
The activation atlas: https://distill.pub/2019/activation-atlas/
The Tensorflow playground: https://playground.tensorflow.org/
The Neural Network Playground (train simple neural networks in the browser): https://nnplayground.com/
Visualize different neural network architectures: http://alexlenail.me/NN-SVG/index.html
Digestible research
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers summarizes a long list of papers that is definitely useful for your interpretability projects.
Andrej Karpathy's "Understanding what convnets learn"
12 toy language models designed to be easier to interpret, in the style of a Mathematical Framework for Transformer Circuits: 1, 2, 3 and 4 layer models, for each size one is attention-only, one has GeLU activations and one has SoLU activations (an activation designed to make the model's neurons more interpretable - https://transformer-circuits.pub/2022/solu/index.html) (these aren't well documented yet, but are available in EasyTransformer)
Anthropic Twitter thread going through some language model results
Below is a talk by Esben on a principled introduction to interpretability for safety:
Entries
Check back later to see entries to this event
Our Other Sprints
Apr 25, 2025
-
Apr 27, 2025
Economics of Transformative AI: Research Sprint
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible
Apr 25, 2025
-
Apr 26, 2025
Berkeley AI Policy Hackathon
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible