This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Deception Detection Hackathon: Preventing AI deception
660d65646a619f5cf53b1f56
Deception Detection Hackathon: Preventing AI deception
July 1, 2024
Accepted at the 
660d65646a619f5cf53b1f56
 research sprint on 

Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders

Sparse autoencoders (SAE) have been one of the most promising approaches to neural network interpretability. They can be used to recover highly interpretable features which are otherwise obscured in superposition. However, the large num- ber of generated features makes it necessary to use models for natural language labelling. While one model labels features based on observed activations, another model is used to estimate activations on texts. These estimates are then compared to the ground truth activations to create a score. Using models for automating interpretability introduces a key weakness into the process, since deceptive models may want to hide certain features by mislabelling them. For example, models may want to hide features about deception from human overseers. We demonstrate a method by which models can create deceptive explanations that not only avoid detection but also achieve high scores. Our agents are based on the Llama 3 70B Instruct model. While our agents can perform the entire task end-to-end, they are not reliable and can sometimes fail. We propose some directions for future mitigation strategies that may help defend against our method.

By 
Simon Lermen, Mateusz Dziemian
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission is under review.
Oops! Something went wrong while submitting the form.

This project is private