This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Reprogramming AI Models Hackathon
6710eab8447f62cdea3a653c
Reprogramming AI Models Hackathon
November 25, 2024
Accepted at the 
6710eab8447f62cdea3a653c
 research sprint on 

Feature Tuning versus Prompting for Ambiguous Questions

This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting. Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.

By 
Elis Grahn, Axel Ahlqvist, Elliot Gestrin, Hemming Gong
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private