Jun 30, 2024
Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders
Simon Lermen, Matteusz Dziemian
Summary
Sparse autoencoders (SAE) have been one of the most promising approaches to
neural network interpretability. They can be used to recover highly interpretable
features which are otherwise obscured in superposition. However, the large num-
ber of generated features makes it necessary to use models for natural language
labelling. While one model labels features based on observed activations, another
model is used to estimate activations on texts. These estimates are then compared
to the ground truth activations to create a score. Using models for automating
interpretability introduces a key weakness into the process, since deceptive models
may want to hide certain features by mislabelling them. For example, models may
want to hide features about deception from human overseers. We demonstrate a
method by which models can create deceptive explanations that not only avoid
detection but also achieve high scores. Our agents are based on the Llama 3 70B
Instruct model. While our agents can perform the entire task end-to-end, they
are not reliable and can sometimes fail. We propose some directions for future
mitigation strategies that may help defend against our method.
Cite this work:
@misc {
title={
Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders
},
author={
Simon Lermen, Matteusz Dziemian
},
date={
6/30/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}