This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
Deception Detection Hackathon: Preventing AI deception
Deception Detection Hackathon: Preventing AI deception
July 1, 2024
Accepted at the 
 research sprint on 

From Sycophancy (not) to Sandbagging

We investigate zero-shot generalization of sycophancy to sandbagging. We develop an evaluation suite and test harness for Huggingface models, consisting of a simple sycophancy evaluation dataset and a more advanced sandbagging evaluation dataset. For Llama3-8b-Instruct, we do not find evidence that reinforcing sycophantic behavior in the first environment generalizes to an increase in zero-shot sandbagging in the second environment.

Felix Hofstätter, Daniel Tan, Sohaib Imran, David Quarel
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission is under review.
Oops! Something went wrong while submitting the form.

This project is private