Keep Apart Research Going: Donate Today
Jul 1, 2024
Boosting Language Model Honesty with Truthful Suffixes
Smitty van Bodegom, Giles Edkins, Annie Szorkin
Summary
We investigate the construction of truthful suffixes, which cause models to provide more truthful responses to user queries. Prior research has focused on the use of adversarial suffixes for jailbreaking; we extend this to causing truthful behaviour.
Cite this work:
@misc {
title={
Boosting Language Model Honesty with Truthful Suffixes
},
author={
Smitty van Bodegom, Giles Edkins, Annie Szorkin
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}