Jun 30, 2024
Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders
Simon Lermen, Matteusz Dziemian
Sparse autoencoders (SAE) have been one of the most promising approaches to
neural network interpretability. They can be used to recover highly interpretable
features which are otherwise obscured in superposition. However, the large num-
ber of generated features makes it necessary to use models for natural language
labelling. While one model labels features based on observed activations, another
model is used to estimate activations on texts. These estimates are then compared
to the ground truth activations to create a score. Using models for automating
interpretability introduces a key weakness into the process, since deceptive models
may want to hide certain features by mislabelling them. For example, models may
want to hide features about deception from human overseers. We demonstrate a
method by which models can create deceptive explanations that not only avoid
detection but also achieve high scores. Our agents are based on the Llama 3 70B
Instruct model. While our agents can perform the entire task end-to-end, they
are not reliable and can sometimes fail. We propose some directions for future
mitigation strategies that may help defend against our method.
Rudolf L
The work suggests a novel way that SAE feature labelling may fail, due to coordinated deception between the explainer model and simulation model in the SAE feature labelling process. This is a novel and clever attack, and pointing it out is a worthwhile contribution on its own. The work is also especially well-presented for a hackathon project (though an additional few sentences on the Table 1 caption would make interpreting the results easier - I’m a bit confused about what the columns mean). Future work could experiment with the discussed mitigation strategies.
Esben Kran
This is a clear potential deception issue in future automated interpretability work and proves a key limitation to the approach of using LLMs without enough human oversight in our neuron explanations. I especially liked the figure 6 encoded message ;) A great and in-depth project for a weekend's work and revealing a clear vulnerability. I wonder about its applicability to SoTA methods in interpretability since LLM-based explanation and prediction models might not be feasible with compute limitations on larger models. It would also be nice with more reporting of results though the conceptual work is clearly robust. Great work and definitely worth a write-up for the AlignmentForum.
Cite this work
@misc {
title={
@misc {
},
author={
Simon Lermen, Matteusz Dziemian
},
date={
6/30/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}