Nov 24, 2024
Steering Swiftly to Safety with Sparse Autoencoders
Agatha Duzan, Guillaume Martres, Syrine Noame, Abhinand Shibu, Flavia Wallenhorst, Arthur Wuhrmann
We explore using SAEs for unlearning dangerous capabilities in a cheaper and more interpretable way.
Very well structured work and a very insightful comparison. Pity that the Goodfire API restrictions got in the way of reproducing the reference study. The proposed extension about testing robustness seems very promising and worth exploring.
I like the comparison between different approaches to using SAEs to unlearn dangerous knowledge. This project uses a very sensible approach (e.g. including MMLU as standard performance benchmark) and is clearly presented. In future, it could be interesting to explore the robustness of the unlearning and why performance on MMLU comp sci appears to increase.
Cite this work
@misc {
title={
Steering Swiftly to Safety with Sparse Autoencoders
},
author={
Agatha Duzan, Guillaume Martres, Syrine Noame, Abhinand Shibu, Flavia Wallenhorst, Arthur Wuhrmann
},
date={
11/24/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


