Jul 1, 2024
Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach
Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy
This project investigates deception detection in GPT-3.5-turbo using response metadata. Researchers analyzed 300 prompts, generating 1200 responses (600 baseline, 600 potentially deceptive). They examined metrics like response times, token counts, and sentiment scores, developing a custom algorithm for prompt complexity.
Key findings include:
1. Detectable patterns in deception across multiple metrics
2. Evidence of "sandbagging" in deceptive responses
3. Increased effort for truthful responses, especially with complex prompts
The study concludes that metadata analysis could be promising for detecting LLM deception, though further research is needed.
Natalia
Really interesting approach to detecting deception. Surprising results - intuitively, I would have expected the opposite trend (more effort/slower response times for deceptive responses) - really liked that you linked this to sandbagging, although I would suggest further testing of this. Might be interesting to compare to human behavioural data as well. For completeness, I would have liked to see a comparison with other existing methods of detecting deception. Perhaps something to consider for future work? Excellent work overall.
Esben Kran
This project implements a new method to detect deception that doesn't rely on content or weight/bias analysis, which is novel and interesting! Having results on multiple models would have significantly strengthened the conclusions given that it must be different between models. Creating a general method that can work between models after a model calibration step would be very interesting. Great work otherwise, especially putting together your own dataset and presenting all the metadata. I do have some questions about the assumptions and before trusting the method, I'd need significantly more data. The compute efficiency of this compared to other methods if 10/10. Great work though!
David Matolcsi
I like the idea of testing metadata, and I think this might grow into.a useful (though clearly not adequate on its own) technique for deception detection. The experiments are well run and the results are clearly presented. I found the section on Evidence of Sandbagging weak though, I felt it’s premature to think of this as an example of the model deliberately downplaying its abilities, I don’t see why the usual strategic reasons to do that would apply here, and I expect there are more mundane explanations for the observed phenomenon.
Cite this work
@misc {
title={
@misc {
},
author={
Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}