Jun 30, 2024
Gradient-Based Deceptive Trigger Discovery
Henning Bartsch, Leon Eshuijs
Summary
To detect deceptive behavior in autoregressive transformers we ask the question: what variation of the input would lead to deceptive behavior? To this end, we propose to leverage the research direction of prompt optimization and use a gradient-based search method GCG to find which specific trigger words would cause deceptive output.
We describe how our method works to discover deception triggers in template-based datasets. Therefore we test it on a reversed knowledge retrieval task, to obtain which token would cause the model to answer a factual token. Although coding constraints prevented us from testing our setup on real deceptively trained models we further describe this setup
Cite this work:
@misc {
title={
Gradient-Based Deceptive Trigger Discovery
},
author={
Henning Bartsch, Leon Eshuijs
},
date={
6/30/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}