Jun 30, 2024

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Simon Lermen, Matteusz Dziemian

Details

Details

Arrow
Arrow
Arrow

Summary

Sparse autoencoders (SAE) have been one of the most promising approaches to

neural network interpretability. They can be used to recover highly interpretable

features which are otherwise obscured in superposition. However, the large num-

ber of generated features makes it necessary to use models for natural language

labelling. While one model labels features based on observed activations, another

model is used to estimate activations on texts. These estimates are then compared

to the ground truth activations to create a score. Using models for automating

interpretability introduces a key weakness into the process, since deceptive models

may want to hide certain features by mislabelling them. For example, models may

want to hide features about deception from human overseers. We demonstrate a

method by which models can create deceptive explanations that not only avoid

detection but also achieve high scores. Our agents are based on the Llama 3 70B

Instruct model. While our agents can perform the entire task end-to-end, they

are not reliable and can sometimes fail. We propose some directions for future

mitigation strategies that may help defend against our method.

Cite this work:

@misc {

title={

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

},

author={

Simon Lermen, Matteusz Dziemian

},

date={

6/30/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Review

Review

Arrow
Arrow
Arrow

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

No reviews are available yet

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.