Jul 1, 2024
Detecting Deception with AI Tics 😉
Samuel Svenningsen, Ilan Moscovitz, Nikhil Kotecha
We present a novel approach: intentionally inducing subtle "tics" in AI responses as a marker for deceptive behavior. By adding a system prompt, we embed innocuous yet detectable patterns that manifest when the AI knowingly engages in deception.
Rudolf L
This work tests whether models follow a system prompt that tells them to mark any lie with a winky face. The robustness of the method is not discussed (though fine-tuning it in is suggested), nor is it compared to existing deception detection methods (e.g. contrast-consistent search and variants, activation vectors, blackbox lie detection). It is suggested that future models would find it hard to not include such tells if fine-tuned, but this statement is not supported. In particular, deception is only a problem if fine-tuning models for helpfulness & harmlessness does not work, so in any deception scenario, simple at least some simple fine-tuning strategies have already broken. Future work could explore whether fine-tuning such tells are more robust than other types of fine-tuning aimed at preventing or detecting deception.
Cite this work
@misc {
title={
@misc {
},
author={
Samuel Svenningsen, Ilan Moscovitz, Nikhil Kotecha
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}