Jun 30, 2024
From Sycophancy (not) to Sandbagging
Felix Hofstätter, Daniel Tan, Sohaib Imran, David Quarel
We investigate zero-shot generalization of sycophancy to sandbagging. We develop an evaluation suite and test harness for Huggingface models, consisting of a simple sycophancy evaluation dataset and a more advanced sandbagging evaluation dataset. For Llama3-8b-Instruct, we do not find evidence that reinforcing sycophantic behavior in the first environment generalizes to an increase in zero-shot sandbagging in the second environment.
No reviews are available yet
Cite this work
@misc {
title={
From Sycophancy (not) to Sandbagging
},
author={
Felix Hofstätter, Daniel Tan, Sohaib Imran, David Quarel
},
date={
6/30/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


