Behind the Features: Goodfire's Interpretability Tools in Action

Behind the Features: Goodfire's Interpretability Tools in Action

Goodfire's Myra reveals how their groundbreaking API enabled 200+ researchers at Apart's global hackathon to advance AI interpretability, demonstrating new ways to make AI systems more transparent and controllable.

January 17, 2025
January 17, 2025

In November 2024, over 200 researchers and developers came together for the Reprogramming AI Models Hackathon, a groundbreaking event that put Goodfire's interpretability tools to the test. The results were nothing short of remarkable - from teaching AI models to play games through neural feature manipulation to building robust defenses against adversarial attacks.

But before we dive into the technical innovations, let's understand why this matters. As AI systems become more powerful and widespread, the ability to understand and control their internal workings isn't just an academic exercise - it's crucial for building safe and reliable AI. This is where Goodfire's mission comes in.

The hackathon showcased remarkable applications of Goodfire's tools - from detecting adversarial attacks to teaching AI systems to "unlearn" harmful capabilities while preserving beneficial ones. Teams leveraged Goodfire's Ember API to peek inside AI models and manipulate their behavior in controlled, interpretable ways.

What made this possible? I sat down with Myra, the founding PM at the Goodfire team to discuss their tools, the hackathon's impact, and the future of interpretability research.

Archana: Hi Myra! The field of mechanistic interpretability has historically been limited to researchers with significant computational resources and specialized knowledge. How is Goodfire changing this landscape?

"We are a research lab advancing the field of AI interpretability. Our mission is to solve the technical alignment problem for powerful AI systems. We believe that major advances in mechanistic interpretability - understanding how AI systems work internally - are key to solving this challenge. At the core of our work is creating accessible tools that accelerate interpretability research."

Archana: What inspired the development of your feature steering API?

"Our work builds on recent research developments showing that human-interpretable features can be extracted from and edited in LLMs. Model editing and steering represent one of the most promising applications of current-day interpretability research."

Through our now publicly released hosted API, Ember, we've made it possible to conduct large-scale research in an accessible way. We designed Ember to be easy to set up and use, making interpretability research accessible to anyone interested in exploring this field."

Archana: Your API achieved something remarkable during the hackathon - enabling teams to build everything from adversarial detectors to moral reasoning analyzers. The hackathon projects achieved impressive results using your SAE-derived features. Could you explain how these features are extracted and what makes them particularly useful for steering?

"We trained state-of-the-art interpreter models (specifically sparse autoencoders or SAEs) for Llama 8B and 70B on the LMSYS-Chat-1M chat dataset and generated high-quality, human-readable labels for features using an automated interpretability pipeline. We evaluated a number of SAEs trained with different hyperparameters and ran internal benchmarking to measure relative steering performance. We then chose the best SAE and made it accessible via the API."

"You can read a more detailed report on our approach in our research post “Understanding and Steering Llama 3)."

Archana: We saw teams using your API for everything from adversarial detection to model unlearning. What technical capabilities enabled this versatility?

"We designed a simple interface that focuses on Sparse Autoencoder (SAE) features as the primary building block. Through our Features API, users can modify existing models like Llama 3.1 8B by adjusting feature weights. After extensive experimentation and benchmarking, we discovered the most effective methods for feature modification and made these insights available through our API."

Archana: How does your approach to feature identification differ from other interpretability tools?

"We defined three ways to find features you may want to modify. Auto Steer: Simply describe what you want, and let the API automatically select and adjust feature weights. Feature Search: Find features using semantic search and Contrastive Search: Identify relevant features by comparing two different datasets."

The hackathon revealed an emerging pattern in how researchers approach interpretability problems. With nearly 250 participants from 15+ countries, we saw diverse approaches to using Goodfire's tools:

Archana: The winning projects demonstrated novel applications of your tools. Were there any implementations that surprised you or opened new possibilities?

"We’ve consolidated some of the projects we were really impressed with in this twitter thread."

"We found it really valuable to have a group of around 200 participants hacking on our API over a short period of time. We got a chance to stress test our infrastructure’s ability to handle concurrent load, and also made a number of modifications to our API and documentation in response to participants’ feedback (e.g.,, making it really easy to look-up features by their indices)."

What makes this interview particularly compelling is the context: Goodfire's tools weren't just tested in isolated experiments, but battle-tested by hundreds of researchers simultaneously. The hackathon served as both a proof-of-concept for their API's robustness and a fountain of novel applications.

Archana: After seeing so many participants work with your tools, what patterns emerged in how researchers approach interpretability problems? How has the feedback from hackathon participants influenced your roadmap for future features?

"We've seen huge interest in feature steering capabilities from everyone - both hobby API users and experienced interpretability researchers. We're particularly excited about refining how we find effective feature interventions for steering. Looking ahead, we want to support the latest open-source models. We believe that to truly understand AI, we need access to the internal workings of models that show emergent behaviors - things like reasoning, complex coding, and other increasingly advanced capabilities."

"As we move forward, the insights gained from this hackathon and the growing community of interpretability researchers may well prove crucial in shaping the future of AI development. Through events like this and tools like Ember, we're not just observing AI's evolution - we're actively participating in making it more transparent, controllable, and aligned with human values."

Participants from our Jam Site in EPFL

Archana: What advice would you give to researchers looking to get started with your interpretability tools?

"Check out our docs for a quickstart guide, and read up on some background with resources like."

"Three areas we’re currently excited about include: improving feature steering performance with auto-steering, using feature steering to make models more robust to jailbreaks, and developing more interpretable classifier models built on features."

Archana: What role do you see hackathons and community events playing in advancing interpretability research?

"We're really excited to keep hosting hackathons with the interpretability community! These events have proven to be incredibly valuable - they're great for meeting and working with peers, getting interpretability newcomers involved in research, and sparking fresh ideas."

The hackathon served as a stress test for Goodfire's API, with hundreds of researchers simultaneously pushing the boundaries of what's possible with interpretability tools. Projects ranged from novel security applications to groundbreaking control mechanisms, demonstrating the versatility and robustness of the platform.

Dive deeper