Nov 24, 2024
Feature based unlearning
Patrick Quinn, Yucheng Sun
An exploration of using features to perform unlearning on answering trivia questions.
Jaime Raldua
The results are pretty impresive given the time constraints and the API rate limits. I would love to see an extension of this work with smaller models and more features
Tom McGrath
This is a good attempt at answering an important safety-relevant question. Unfortunately the current setup doesn't work well enough to accurately ablate factual knowledge, but it was worth trying and the methodology used here is sufficient to answer the question.
It's possible that using attribution would have improved feature selection and made it more automated. The results in table 1 are impressive - I'd be interested in seeing more failure cases however as (as the authors indicate) Figure 1 tells a different story at the scale of the entire dataset.
Liv Gorton
A nice, practical approach to unlearning via sparse autoencoders! The features often being related to general question answering demonstrates one of the important challenges with scaling unlearning generally. I agree with the point about analysis of those manually discovered, more effective features being useful and it'd be cool to see if there's some sort of automated LLM workflow that would be able to surface those same features with less effort.
Cite this work
@misc {
title={
@misc {
},
author={
Patrick Quinn, Yucheng Sun
},
date={
11/24/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}