Nov 24, 2024
Recovering Goodfire's SAE feature vectors from their API
Lovkush Agarwal
In this project, we carry out an early trial to see whether Goodfire’s SAE feature vectors can be recovered using the information available from their API.
The strategy tried is: pick a feature of interest, construct a contrastive dataset using Goodfire’s API, then use TransformerLens to get a steering vector for the contrastive dataset, by simply calculating the average difference in the activations in each pair.
Jaime Raldua
There seems to be existing very similar work and the results from this trial (as acknoledged from the author) is not very succesful. Overall it is an interesting approach that could benefit from more lit.review and further exploration.
Liv Gorton
I was excited to see someone try to do this -- I think how easy it is to reconstruct things like the dictionary vectors has important practical implications. This work would be improved by providing more motivation for the technique of choice, perhaps providing an example or figure that demonstrates why it should work.
Mateusz Dziemian
Interesting study direction, which seems to be very worthwhile for further study. I think that the example that you provide does show some success in this idea. I’m mainly not sure if it makes sense to find the activations in residual stream for features which are monosemantic from an SAE, as the reason for SAEs is that the residual stream itself is polysemantic.
Simon Lermen
Similar to papers such as Recovering the Input and Output Embedding of OpenAI Models (https://arxiv.org/pdf/2403.06634), this paper seeks to recover the SAE vectors from Goodfire's API. The authors admit that they failed at their objective but generally conclude that it should be possible. The method they propose seems reasonable, but I am confused about a key detail: the steering vectors are typically applied to the residual stream, though it’s not entirely clear at which position. The authors want to cache output activations, though those might be in the token space after applying the output embedding? It seems that their method doesn't quite work for some reason. "The steering vector is the average of all these differences, across n and across the dataset." <- This statement might be too strong; it depends on a lot of factors, such as where the steering vector is applied, which datasets are used, and how contrastive pairs are generated.
Cite this work
@misc {
title={
@misc {
},
author={
Lovkush Agarwal
},
date={
11/24/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}