The hackathon is happening right now! Join by signing up below and be a part of our community server.
Apart > Sprints

Reprogramming AI Models Hackathon

--
Signups
--
Entries
November 22, 2024 4:00 PM
 to
November 25, 2024 3:00 AM
 (UTC)
Hackathon starts in
--
Days
--
Hours
--
Minutes
--
Seconds
Sign upSign up
This event is finished. It occurred between 
November 22, 2024
 and 
November 25, 2024

Join us for a groundbreaking exploration of AI model internals, where we'll dive deep into mechanistic interpretability and feature manipulation. In partnership with Goodfire, we're bringing you unprecedented access to state-of-the-art tools for understanding and steering AI behavior.

Whether you're an AI researcher, a curious developer, or passionate about making AI systems more transparent and controllable, this hackathon is for you. As a participant, you will:

  • Collaborate with experts to create novel AI observability tools
  • Learn about mechanistic interpretability from industry leaders
  • Contribute to solving real-world challenges in AI safety and reliability
  • Compete for prizes and the opportunity to influence the future of AI development

Register now and be part of the movement towards more transparent, reliable, and beneficial AI systems. We provide access to Goodfire's SDK/API and research preview playground, enabling participation regardless of prior experience with AI observability.

Why This Matters

As AI models become more powerful and widespread, understanding their internal mechanisms isn't just academic curiosity—it's crucial for building reliable, controllable AI systems. Mechanistic interpretability gives us the tools to peek inside these "black boxes" and understand how they actually work, neuron by neuron and feature by feature.

What You'll Get

  • Exclusive Access: Use Goodfire's API to access an interpretable 8B or 70B model with efficient inference.
  • Cutting-Edge Tools: Experience Goodfire's SDK/API for feature steering and manipulation
  • Advanced Capabilities: Work with conditional feature interventions and sophisticated development flows
  • Free Resources: Compute credits for every team to ensure you can pursue ambitious projects
  • Expert Guidance: Direct mentorship from industry leaders throughout the weekend

Project Tracks

1. Feature Investigation

  • Map and analyze feature phenomenology in large language models
  • Discover and validate useful feature interventions
  • Research the relationship between feature weights and intervention success
  • Develop metrics for intervention quality assessment

2. Tooling Development

  • Build tools for automated feature discovery
  • Create testing frameworks for intervention reliability
  • Develop integration tools for existing ML frameworks
  • Improve auto-interpretation techniques

3. Visualization & Interface

  • Design intuitive visualizations for feature maps
  • Create interactive tools for exploring model internals
  • Build dashboards for monitoring intervention effects
  • Develop user interfaces for feature manipulation

4. Novel Research

  • Investigate improvements to auto-interpretation
  • Study feature interaction patterns
  • Research intervention transfer between models
  • Explore new approaches to model steering

Why Goodfire's Tools?

While participants are welcome to use their existing setups, Goodfire's API brings exceptional value to this hackathon as a primary option for participants.

Goodfire provides:

  • Access to a 70B parameter model via API (with efficient inference)
  • Feature steering capabilities made simple through the SDK/API
  • Advanced development workflows including conditional feature interventions

The hackathon serves as a unique opportunity for Goodfire to gather valuable feedback from the developer community on their API/SDK. To ensure all participants can pursue ambitious research projects without constraints, Goodfire is providing free compute credits to every team.

Previous Participant Experiences

"I learned so much about AI Safety and Computational Mechanics. It is a field I have never heard of, and it combines two of my interests - AI and Physics. Through the hackathons, I gained valuable connections and learned a lot from researchers with extensive experience." - Doroteya Stoyanova, Computer Vision Intern

Speakers & Collaborators

Tom McGrath

Chief Scientist at Goodfire, previously Senior Research Scientist at Google DeepMind, where he co-founded the interpretability team
Organiser and Judge

Neel Nanda

Team lead for the mechanistic interpretability team at Google Deepmind and a prolific advocate for open source interpretability research.
Speaker & Judge

Dan Balsam

CTO at Goodfire, previously Founding Engineer and Head of AI at RippleMatch. Goodfire makes Interpretability products for safe and reliable generative AI models.
Organiser and Mentor

Myra Deng

Founding PM at Goodfire, Stanford MBA and MS CS graduate previously building modeling platforms at Two Sigma
Organiser

Archana Vaidheeswaran

Archana is responsible for organizing the Apart Sprints, research hackathons to solve the most important questions in AI safety.
Organizer

Jaime Raldua

Jaime has 8+ years of experience in the tech industry. Started his own data consultancy to support EA Organisations and currently works at Apart Research as Research Engineer.
Organiser and Judge

Joseph Bloom

Joseph co-founded Decode Research, a non-profit organization aiming to accelerate progress in AI safety research infrastructure, and is a mechanistic interpretability researcher.
Speaker

Alana X

Member of Technical Staff at Magic, leading initiatives on model evaluations. Previously a Research Intern at METR
Judge

Callum McDougall

ARENA Director and SERI-MATS alumnus specializing in mechanistic interpretability and AI alignment education
Speaker

Mateusz Dziemian

Member of Technical Staff at Gray Swan AI, recently worked on a U.K AISI collaboration. Previous participant in an Apart Sprint which ended in a NeurIPS workshop paper
Judge

Simon Lermen

MATS Scholar under Jeffrey Ladish and is an independent researcher. Mentored a spar project on AI agents and worked on spear-phishing people with AI agents.
Judge

Liv Gorton

Founding Research Scientist at Goodfire AI. Undertook independent research on sparse autoencoders in InceptionV1
Judge

Esben Kran

Esben is the co-director of Apart Research and specializes in organizing research teams on pivotal AI security questions.
Organizer

Jason Schreiber

Jason is co-director of Apart Research and leads Apart Lab, our remote-first AI safety research fellowship.
Organizer

To ensure you're well-equipped for the Reprogramming AI Models Hackathon, we've compiled a set of resources to support your participation:

  1. Goodfire's SDK/API with hosted inference: Your primary toolkit for the hackathon. Familiarize yourself with our framework for understanding and modifying AI model behavior.
    • Hosted inference on Llama 3 8B and 70B models
    • Feature inspection and intervention capabilities: https://docs.goodfire.ai/examples/advanced.html#Feature-intervention-modes
    • Example notebooks and tutorials: https://docs.goodfire.ai/examples/quickstart.html#Use-contrastive-features-to-fine-tune-with-a-single-example!
    • Latent explorer visualization tools: https://docs.goodfire.ai/examples/latent_explorer.html
    • Rate limits and usage guidelines: https://docs.goodfire.ai/rate-limits.html
  2. Research Preview Playground
    • Sandbox environment for model experimentation: https://docs.goodfire.ai/examples/quickstart.html#Replace-model-calls-with-OpenAI-compatible-API
    • Feature activation analysis tools: https://docs.goodfire.ai/examples/advanced.html#Feature-intervention-modes
    • Conditional intervention testing: https://docs.goodfire.ai/examples/advanced.html#Conditional-feature-interventions
  3. Check out the Jupyter Notebook Quickstart: . In this quickstart, you'll learn how to:
    • Sample from a language model (in this case, Llama 3 8B)
    • Search for exciting features and intervene in them to steer the model
    • Find features by contrastive search
    • Save and load Llama models with steering applied
  4. Tutorial: Visualizing AI Model Internals: Watch this video to understand how to use Goodfire's tools to map and visualize AI model behavior.

  1. The Cognitive Revolution Podcast - Episode on Interpretability. n this episode of The Cognitive Revolution, we delve into the science of understanding AI models' inner workings, recent breakthroughs, and the potential impact on AI safety and control
  1. Auto-interp Paper: This paper applies automation to the problem of scaling an interpretability technique to all the neurons in a large language model.
  2. ARENA Interpretability with SAEs
  3. Gemma Scope: a comprehensive, open suite of sparse autoencoders for language model interpretability.
  4. Neuronpedia: Platform for accelerating research into Sparse Autoencoders
  5. The Geometry of Concepts: Sparse Autoencoder Feature Structure Paper. This paper investigates the structured organization of concept representations within large language models using sparse autoencoders, revealing a multi-scale structure with refined atomic parallelogram forms, modular brain-like spatial features, and anisotropic galaxy-scale distributions with unique eigenvalue properties.
  6. Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
  7. Lesswrong search for SAE
See the updated calendar and subscribe

Here is the schedule for the Hackathon:
We start with an introductory talk and end the event during the following week with an awards ceremony. Join the public ICal here. You will also find Explorer events, such as collaborative brainstorming and team match-making before the hackathon begins on Discord and in the calendar.

📍 Registered jam sites

Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.

Reprogramming AI Models Hackathon: Edinburgh

Reprogramming AI Models Hackathon: Edinburgh

AISIG - Decode AI's Black Box & Engineer Model Behavior Hackathon

Join us for the Hackathon in Hereplein 4, 9711GA, Groningen!

AISIG - Decode AI's Black Box & Engineer Model Behavior Hackathon

Reprogramming AI Models Hackathon

This is a collaboration between Warwick AI and Warwick Effective Altruism. We will be hosting groups that wish to participate in the hackathon for the weekend.

Reprogramming AI Models Hackathon

Reprogramming AI Models Hackathon: CAISH

Cambridge hub for hosting the Reprogramming AI hackathon. Office available with monitors and snacks!

Reprogramming AI Models Hackathon: CAISH

Reprogramming AI Models Hackathon: EPFL hub

A local hub for the hackathon on the EPFL campus (luma coming soon). We will provide a room, snacks and drinks for the participants.

Reprogramming AI Models Hackathon: EPFL hub

🏠 Register a location

The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community. Read more about organizing.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received! Your event will show up on this page.
Oops! Something went wrong while submitting the form.

📣 Social media images and text snippets

No media added yet
No text snippets added yet


Use this template for your submission [Required]

Template in Overleaf/LaTeX

Submission Requirements

Each team should submit a research paper that includes:

  1. Project title and team members
  2. Executive summary (max 250 words)
  3. Introduction and problem statement
  4. Methodology and approach
  5. Results and analysis
  6. Discussion of implications for AI interpretability
  7. Conclusion and future work
  8. References

Additionally, teams should provide:

  • A link to their code repository (e.g., GitHub)
  • Any demo materials or visualizations (if applicable)

Evaluation Criteria

Submissions will be judged based on the following criteria:

  1. Interpretability Advancement
    • Does the project contribute to the field of AI interpretability?
    • Does it provide new insights into understanding or steering AI model behavior?
    • How well does it align with the hackathon's focus on reprogramming AI models?
    • How original and innovative is the approach?
  2. Relevance for AI Safety
    • How important is the contribution to advancing the field of AI safety?
    • Do we expect the results to generalize beyond the specific case(s) presented in the submission?
    • Does the approach introduce new safety mechanisms or enhance existing ones in innovative ways?
  3. Methodology & presentation
    • How well is the project executed from a technical standpoint?
    • Is the code well-structured, documented, and reproducible?
    • How effectively does it utilize Goodfire's SDK/API and other provided resources?
    • How clearly and effectively is the research presented in the paper?
    • Quality of visualizations and demos (if applicable)
    • Clarity of methodology explanation and results interpretation

Key Points to Remember:

  • Focus on creating tools that provide meaningful insights into model behavior or enable precise modifications.
  • Consider how your approach might scale to larger, more complex AI systems.
  • Ethical considerations should be at the forefront of your design process.

💡 Frequently Asked Questions

General Questions

Q: What is the timeline for the hackathon?

A: The hackathon runs from Friday, 22nd November, through Monday, 25th November(3 AM UTC). There will be two brainstorming sessions (Tuesday and Thursday), with API access beig granted starting Thursday morning.

Q: What team size is recommended?

A: Teams should be 2-4 people, with a maximum of 5. While individual submissions are allowed, team participation is encouraged for better project outcomes.

Q: How do we get access to the API?

A:

  1. Form your team
  2. Have the team lead fill out the registration form
  3. Receive API access credentials (starting Thursday morning 21st November)

Technical Questions

SDK Specifics

Q: What exactly do we have access to in the SDK?

A: The SDK provides:

  • Feature inspection and intervention on Llama 3 8B and 70B
  • Single residual stream layer access
  • Feature activation analysis
  • Contrastive and semantic search
  • No raw neuron activations or decoder weights

Q: How do the different search methods work?

A:

  1. Semantic search: Uses embedding similarity with cosine similarity
  2. Contrastive search: Compares feature activations between datasets
  3. Reranking: Filters features based on relevance to specific queries

Q: What are the recommended weight ranges for feature intervention?

A:

  • Range: -1 to 1
  • Emphasis: Start with 0.5
  • Ablation: Start with -0.3
  • Multiple features can be modified simultaneously

Feature Understanding

Q: How can we find relevant features for our task?

A: The SDK provides several methods:

  1. Search: Query features using string searches
  2. Inspect: Analyze feature activations for specific inputs
  3. Contrast: Compare features between different datasets

Q: Can we modify feature activations?

A: Yes, you can perform feature interventions to modify model behavior, including conditional interventions.

Q: Are the SAE features proprietary?

A: Yes, Goodfire's SAE features are proprietary but may be open-sourced in the future.

Q: How are features discovered and organized?

A:

  • Features correspond to frequency in LMSIS training data
  • More common concepts have better feature representation
  • Some moderation for safety concerns
  • Features have associated correctness scores

Q: What types of features are most reliable?

A:

  • Common properties and concepts
  • Frequently discussed topics
  • Specific nouns and properties
  • General behavioral patterns

Q: Can we modify the feature discovery process?

A: No, features are pre-discovered, but you can:

  • Use contrast analysis to find relevant features
  • Combine multiple features
  • Create feature groups
  • Experiment with feature weights

Technical Implementation

Q: How do we handle data preparation?

A:

  • Can use external libraries for PDF processing
  • OCR tools available for document processing
  • Text extraction supported
  • Dataset creation tools available

Q: Can we integrate with other frameworks?

A:

  • Yes, SDK works alongside other tools
  • OpenAI-compatible API interface
  • Can combine with external libraries
  • Support for custom evaluation frameworks

Q: What are the computational requirements?

A:

  • $50 compute credits per team
  • Rate limits with exponential backoff
  • Can request increased limits with justification
  • Efficient API usage recommended

Project Development

Q: How should we approach feature exploration?

A:

  1. Start with semantic search
  2. Use contrastive analysis for refinement
  3. Experiment with feature weights
  4. Build on example notebooks

Q: What are good starting points for projects?

A:

  • Chain of thought analysis
  • Feature visualization
  • Model behavior steering
  • Ethical reasoning investigation

Q: How can we evaluate our results?

A:

  • Use provided evaluation frameworks
  • Compare with baseline performance
  • Test across different contexts
  • Document feature interactions

Support & Resources

Q: Where can we get help during the hackathon?

A: You can:

  1. Use the help desk channel on Discord
  2. Contact the technical support team directly
  3. Attend office hours (Saturday/Sunday)
  4. Participate in hack talks

Q: What if we need increased rate limits?

A: Submit a request through the provided form with justification, and the team will review it case by case.

Q: Are there example notebooks available?

A: Yes, Goodfire provides several example notebooks, including:

Q: How can we submit feature requests or feedback?

A: While immediate changes during the hackathon may not be possible, Goodfire welcomes all feedback and feature requests for future improvements.

Q: What if we encounter technical issues?

A:

Q: How can we get help with project direction?

A:


Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
You have successfully submitted! You should receive an email and your project should appear here. If not, contact operations@apartresearch.com.
Oops! Something went wrong while submitting the form.

Here are entries for the hackathon

AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control
Traditional fine-tuning methods for language models, while effective, often disrupt internal model features that could provide valuable insights into model behavior. We present a novel approach combining Reinforcement Learning (RL) with Activation Steering to modify model behavior while preserving interpretable features discovered through Sparse Autoencoders. Our method automates the typically manual process of activation steering by training an RL agent to manipulate labeled model features, enabling targeted behavior modification without altering model weights. We demonstrate our approach by reprogramming a language model to play Tic Tac Toe, achieving a 3X improvement in performance compared to the baseline model when playing against an optimal opponent. The method remains agnostic to both the underlying language model and RL algorithm, offering flexibility for diverse applications. Through visualization tools, we observe interpretable feature manipulation patterns, such as the suppression of features associated with illegal moves while promoting those linked to optimal strategies. Additionally, our approach presents an interesting theoretical complexity trade-off: while potentially increasing complexity for simple tasks, it may simplify action spaces in more complex domains. This work contributes to the growing field of model reprogramming by offering a transparent, automated method for behavioral modification that maintains model interpretability and stability.
Jeremias Lino Ferrao
December 3, 2024
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Can we steer a model’s behavior with just one prompt? investigating SAE-driven auto-steering
This paper investigates whether Sparse Autoencoders (SAEs) can be leveraged to steer the behavior of models without using manual intervention. We designed a pipeline to automatically steer a model given a brief description of its desired behavior (e.g.: “Behave like a dog”). The pipeline is as follows: 1. We automatically retrieve behavior-relevant SAE features. 2. We choose an input prompt (e.g.: “What would you do if I gave you a bone?” or “How are you?”) over which we evaluate the model’s responses. 3. Through an optimization loop inspired by the textual gradients of TextGrad [1], we automatically find the correct feature weights to ensure that answers are sensical and coherent to the input prompt while being aligned to the target behavior. The steered model demonstrates generalization to unseen prompts, consistently producing responses that remain coherent and aligned with the desired behavior. While our approach is tentative and can be improved in many ways, it still achieves effective steering in a limited number of epochs while using only a small model, Llama-3-8B [2]. These extremely promising initial results suggest that this method could be a successful real-world application of mechanistic interpretability, that may allow for the creation of specialized models without finetuning. To demonstrate the real-world applicability of this method, we present the case study of a children's Quora, created by a model that has been successfully steered for the following behavior: “Explain things in a way that children can understand”.
Nicole Nobili, Davide Ghilardi, Wen Xing
November 24, 2024
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆