Jul 1, 2024
Sandbag Detection through Model Degradation
Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahan
🏆 1st place by peer review
We propose a novel technique to detect sandbagging in LLMs by adding varying amount of noise to model weights and monitoring performance.
I love how the paper effectively showed a relatively simple solution can have such a promising effect on evaluations. Well explained conceptual framework and very exciting suggestions for further study.
Esben Kran
What a great proposal! A very simple solution building off of clear conceptual thinking that directly improves the evaluation situation. It seems like tinyAI2_arc does not show the same effect of the spike but maybe I'm wrong. It would be good to get robustness evaluation of the method against other datasets as well. The method is super promising, so extending the project to more models and more datasets would be the obvious next step. These are the types of projects I'm really excited about, so this was great work.
David Matolcsi
I find the idea they use to detect sandbagging to be very clever, and I see a significant chance that something similar will be one day actually included in the anti-sandbagging measures used for frontier AIs when the risk emerges. The experiments also look well-done. I am overall very impressed with this work.
Cite this work
@misc {
title={
@misc {
},
author={
Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahan
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}