Sep 29, 2022
-
Oct 1, 2022
Language Model Hackathon
Alignment Jam #1
This event is ongoing.
This event has concluded.
Join this AI safety hackathon to compete in uncovering novel aspects of how language models work! This follows the "black box interpretability" agenda of Buck Shlegeris:
Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals. Read more.

InspirationSee a full list of inspiration, code, and data for the weekend here.AI Safety Ideas project ideasLanguage models are few-shot learners (NeurIPS)TruthfulQA - Stephanie Lin, Jacob Hilton, Owain Evans (ArXiv)Chain of thought prompting (ArXiv) Apart Research's cognitive bias testing for the inverse scaling prize - Esben Kran, Jonathan RystrømEpistemic biases in LLMs - Siméon CamposThe inverse scaling Google Colab - see the left side file view to experiment with datasets (Inverse Scaling prize)
The schedule makes space for 46 hours of research jamming. You can decide your commitment level during the jam with your teammates but we encourage you to remember to sleep and eat.
Entries
Check back later to see entries to this event
Our Other Sprints
Apr 25, 2025
-
Apr 27, 2025
Economics of Transformative AI: Research Sprint
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible
Apr 25, 2025
-
Apr 26, 2025
Berkeley AI Policy Hackathon
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible