Nov 24, 2024
Feature Tuning versus Prompting for Ambiguous Questions
Elis Grahn, Axel Ahlqvist, Elliot Gestrin, Hemming Gong
Summary
This study explores feature tuning as a method to improve alignment of large language models (LLMs). We focus on addressing human psychological fallacies reinforced during the LLM training pipeline. Using sparse autoencoders (SAEs) and the Goodfire SDK, we identify and manipulate features in Llama-3.1-70B tied to nuanced reasoning. We compare this to the common method of controlling LLMs through prompting.
Our experiments find that feature tuning and hidden prompts both enhance answer quality on ambiguous questions to a similar degree, with their combination yielding the best results. These findings highlight feature tuning as a promising and practical tool for AI alignment in the short term. Future work should evaluate this approach on larger datasets, compare it with fine-tuning, and explore its resistance to jailbreaking. We make the code available through a Github repository.
Cite this work:
@misc {
title={
Feature Tuning versus Prompting for Ambiguous Questions
},
author={
Elis Grahn, Axel Ahlqvist, Elliot Gestrin, Hemming Gong
},
date={
11/24/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}