Jul 1, 2024
Can Language Models Sandbag Manipulation?
Arthur Camara, Alexander Cockburn, Myles Heller
Summary
We are expanding on Felix Hofstätter's paper on LLM's ability to sandbag(intentionally perform worse), by exploring if they can sandbag manipulation tasks by using the "Make Me Pay" eval, where agents try to manipulate eachother into giving money to eachother
Cite this work:
@misc {
title={
Can Language Models Sandbag Manipulation?
},
author={
Arthur Camara, Alexander Cockburn, Myles Heller
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}