Keep Apart Research Going: Donate Today
Jul 1, 2024
Evaluating Steering Methods for Deceptive Behavior Control in LLMs
Casey Hird, Basavasagar Patil, Tinuade Adeleke, Adam Fraknoi, Neel Jay
Summary
We use SOTA steering methods, including CAA, LAT, and SAEs to find and control deceptive behaviors in LLMs. We also release a new deception dataset, and demonstrate that the dataset and the prompt formatting used are significant when evaluating the efficacy of steering methods.
Cite this work:
@misc {
title={
Evaluating Steering Methods for Deceptive Behavior Control in LLMs
},
author={
Casey Hird, Basavasagar Patil, Tinuade Adeleke, Adam Fraknoi, Neel Jay
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}