Jun 30, 2024
From Sycophancy (not) to Sandbagging
Felix Hofstätter, Daniel Tan, Sohaib Imran, David Quarel
Summary
We investigate zero-shot generalization of sycophancy to sandbagging. We develop an evaluation suite and test harness for Huggingface models, consisting of a simple sycophancy evaluation dataset and a more advanced sandbagging evaluation dataset. For Llama3-8b-Instruct, we do not find evidence that reinforcing sycophantic behavior in the first environment generalizes to an increase in zero-shot sandbagging in the second environment.
Cite this work:
@misc {
title={
From Sycophancy (not) to Sandbagging
},
author={
Felix Hofstätter, Daniel Tan, Sohaib Imran, David Quarel
},
date={
6/30/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}