Jul 1, 2024

Sandbag Detection through Model Degradation

Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahan

🏆 1st place by peer review

Details

Details

Arrow
Arrow
Arrow

Summary

We propose a novel technique to detect sandbagging in LLMs by adding varying amount of noise to model weights and monitoring performance.

Cite this work:

@misc {

title={

Sandbag Detection through Model Degradation

},

author={

Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahan

},

date={

7/1/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Review

Review

Arrow
Arrow
Arrow

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

No reviews are available yet

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.