Jul 1, 2024
Deceptive behavior does not seem to be reducible to a single vector
Carl Vinas, Zmavli Caimle
Summary
Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.
Cite this work:
@misc {
title={
Deceptive behavior does not seem to be reducible to a single vector
},
author={
Carl Vinas, Zmavli Caimle
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}