This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Reprogramming AI Models Hackathon
6710eab8447f62cdea3a653c
Reprogramming AI Models Hackathon
November 25, 2024
Accepted at the 
6710eab8447f62cdea3a653c
 research sprint on 

Analyzing Dataset Bias with SAEs

Bias in training datasets poses safety risks as it leaves models with gaps in knowledge and a susceptibility to adversarial attacks. We identify spurious correlations in classification datasets by studying feature activation distributions produced from sparse autoencoders. We train linear classifers based on SAE features and find that their feature activation distributions significantly shift between classes. Our results suggest a new, scalable method to evaluating large datasets for bias

By 
Nick Jiang, Joseph Tey
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private