Nov 25, 2024
Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection
Gerard
Summary
The project proposes using weak but trusted AI models to supervise powerful, untrusted models by analyzing their internal states via Sparse Autoencoder features. This approach aims to enhance oversight by detecting complex behaviors like deception within the stronger models. Key challenges include managing large-scale features, ensuring dataset robustness, and avoiding reliance on untrusted systems for labeling.
Cite this work:
@misc {
title={
Tentative proposal for AI control with weak supervisors trough Mechanistic Inspection
},
author={
Gerard
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}