Nov 24, 2024
Steering Swiftly to Safety with Sparse Autoencoders
Agatha Duzan, Guillaume Martres, Syrine Noame, Abhinand Shibu, Flavia Wallenhorst, Arthur Wuhrmann
Summary
We explore using SAEs for unlearning dangerous capabilities in a cheaper and more interpretable way.
Cite this work:
@misc {
title={
Steering Swiftly to Safety with Sparse Autoencoders
},
author={
Agatha Duzan, Guillaume Martres, Syrine Noame, Abhinand Shibu, Flavia Wallenhorst, Arthur Wuhrmann
},
date={
11/24/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}