Keep Apart Research Going: Donate Today
Jul 1, 2024
Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach
Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy
Summary
This project investigates deception detection in GPT-3.5-turbo using response metadata. Researchers analyzed 300 prompts, generating 1200 responses (600 baseline, 600 potentially deceptive). They examined metrics like response times, token counts, and sentiment scores, developing a custom algorithm for prompt complexity.
Key findings include:
1. Detectable patterns in deception across multiple metrics
2. Evidence of "sandbagging" in deceptive responses
3. Increased effort for truthful responses, especially with complex prompts
The study concludes that metadata analysis could be promising for detecting LLM deception, though further research is needed.
Cite this work:
@misc {
title={
Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach
},
author={
Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}