We teamed up with Goodfire AI for a deep dive into mechanistic interpretability and feature manipulation for our latest Hackathon. We had so many amazing submissions and were very impressed by the quality. Neel Nanda, who runs Google DeepMind's Mechanistic Interpretability team, gave the Keynote Talk.
As AI models become more powerful, understanding their internal mechanisms is crucial for building reliable, controllable AI systems. The winning Sprint submissions all attempted to do just this and below we will show you the winners.
01. AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control:
This submission combined reinforcement learning with neural features, improving model behavior 3x while keeping its core knowledge intact.
02. Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities
This team built an AI 'immune system' that catches malicious prompts w/ 100% accuracy, while distinguishing harmful from harmless content with 83% precision.
03. Utilitarian Decision-Making in Models - Evaluation and Steering
This project mapped how AI systems make moral decisions, revealing fascinating insights about how machines process ethical choices differently from humans.
04. Steering Swiftly to Safety w/ Sparse Autoencoders
This team developed a method to selectively remove unwanted knowledge from AI systems while preserving their general capabilities.
Read our X thread here on all the projects! Remember, if an idea is promising, participants may be asked to join our brand new Apart Lab Studio to continue the development of their research project. 16 participants were invited from this Hackathon alone.
See the full list of pilot idea submissions here and never miss another sprint by keeping up-to-date here.