Nov 25, 2024
Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection
Gerard
The project proposes using weak but trusted AI models to supervise powerful, untrusted models by analyzing their internal states via Sparse Autoencoder features. This approach aims to enhance oversight by detecting complex behaviors like deception within the stronger models. Key challenges include managing large-scale features, ensuring dataset robustness, and avoiding reliance on untrusted systems for labeling.
Liv Gorton
Interesting preliminary discussion on how SAEs might be able to enhance the capability of a weaker but trusted LLM to supervise a more powerful, untrusted LLM. The author highlights a lot of important challenges to succeeding in this work. They flagged that they ran out of time to implement their idea and so in future, I’d be really excited to see the results they’d be able to produce with this approach!
Mateusz Dziemian
The plan and study are heading towards an interesting direction. To further help reduce the SAE features, maybe you could ask the trusted model to pick the words/ tokens that it wants features for and have it filter through features based on queries? Or something along that direction where the trusted model is given more control over the task to reduce the search space.
Simon Lermen
Issues: Very little implementation. I think that they propose a simple idea, i.e., monitor which SAE features light up while the LLM is performing a task. However, I am not sure anyone has ever attempted to actually implement this. The main contribution of the report is Section 4, which goes through potential failure modes—for example, unreliable labels or too many features. I think this paper could be quite interesting. For one, they could set up a simple model of supervision, which was their goal, but they couldn't complete it. After creating such a model and finding a few cases where it works, they could then explore different failure modes. This would need to be compared to some other supervision method.
Jaime Raldua
SAEs for AI Control sound like an amazing idea! as you point out there seem to be major blockers on the way and also it would have been very useful to see some code and a more clear roadmap of how your next steps would look like if you would have had more time (e.g. a couple of weeks) to continue on this project
Cite this work
@misc {
title={
@misc {
},
author={
Gerard
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}