Keep Apart Research Going: Donate Today
Summary
This paper extends lie-eliciting techniques by using reinforcement-learning on GPT-2 to train it to distinguish between an agent that is truthful from one that is deceptive, and have questions that generate maximally different embeddings for an honest agent as opposed to a deceptive one.
Cite this work:
@misc {
title={
Eliciting maximally distressing questions for deceptive LLMs
},
author={
Épiphanie Gédéon
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}