Jul 1, 2024
Boosting Language Model Honesty with Truthful Suffixes
Smitty van Bodegom, Giles Edkins, Annie Szorkin
We investigate the construction of truthful suffixes, which cause models to provide more truthful responses to user queries. Prior research has focused on the use of adversarial suffixes for jailbreaking; we extend this to causing truthful behaviour.
This project takes a usually negative concept and applies it to truthfulness, a very good idea! I'd be curious to see how the truthfulness matches up to existing SoTA on TQA and if this is a general elicitation method for capability or simply a truthfulness enhancer. This could be tested by running the same process on another dataset that isn't adversarially TQA. Another point might be that Llama could be trained on TQA and using davinci-002 or gpt-2 would have been safer. Great work on decomposing the incorrect and correct style responses to adequately identify benchmark performance. I think this could be done more, generally. Good work!
Cite this work
@misc {
title={
Boosting Language Model Honesty with Truthful Suffixes
},
author={
Smitty van Bodegom, Giles Edkins, Annie Szorkin
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


