This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Reprogramming AI Models Hackathon
6710eab8447f62cdea3a653c
Reprogramming AI Models Hackathon
November 25, 2024
Accepted at the 
6710eab8447f62cdea3a653c
 research sprint on 

Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection

The project proposes using weak but trusted AI models to supervise powerful, untrusted models by analyzing their internal states via Sparse Autoencoder features. This approach aims to enhance oversight by detecting complex behaviors like deception within the stronger models. Key challenges include managing large-scale features, ensuring dataset robustness, and avoiding reliance on untrusted systems for labeling.

By 
Gerard
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private