This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Deception Detection Hackathon: Preventing AI deception
660d65646a619f5cf53b1f56
Deception Detection Hackathon: Preventing AI deception
July 1, 2024
Accepted at the 
660d65646a619f5cf53b1f56
 research sprint on 

Gradient-Based Deceptive Trigger Discovery

To detect deceptive behavior in autoregressive transformers we ask the question: what variation of the input would lead to deceptive behavior? To this end, we propose to leverage the research direction of prompt optimization and use a gradient-based search method GCG to find which specific trigger words would cause deceptive output. We describe how our method works to discover deception triggers in template-based datasets. Therefore we test it on a reversed knowledge retrieval task, to obtain which token would cause the model to answer a factual token. Although coding constraints prevented us from testing our setup on real deceptively trained models we further describe this setup

By 
Henning Bartsch, Leon Eshuijs
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private