Jun 30, 2024
Gradient-Based Deceptive Trigger Discovery
Henning Bartsch, Leon Eshuijs
To detect deceptive behavior in autoregressive transformers we ask the question: what variation of the input would lead to deceptive behavior? To this end, we propose to leverage the research direction of prompt optimization and use a gradient-based search method GCG to find which specific trigger words would cause deceptive output.
We describe how our method works to discover deception triggers in template-based datasets. Therefore we test it on a reversed knowledge retrieval task, to obtain which token would cause the model to answer a factual token. Although coding constraints prevented us from testing our setup on real deceptively trained models we further describe this setup
David Matolcsi
It’s an interesting experiment proposal, and I hope the authors will manage to finish it. I want to note though that it would be important to discuss whether we should expect this technique to generalize from examples with deliberately inserted backdoors to potential future models where deception arises naturally.
Cite this work
@misc {
title={
@misc {
},
author={
Henning Bartsch, Leon Eshuijs
},
date={
6/30/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}