Finding Deception in Language Models

Finding Deception in Language Models

This June, Apart Research and Apollo Research joined forces to host the Deception Detection Hackathon, bringing together students, researchers and engineers from around the world to tackle one of the most pressing challenges in AI safety: Preventing AI from deceiving humans.

September 5, 2024

The hackathon took place both online and across multiple locations simultaneously, fostering a truly global collaboration. Marius Hobbhahn, the CEO of Apollo Research kicked off the hackathon with a keynote talk about the evaluating deception in AI with white-box and black-box methods. You can watch his talk here. We also had talks by Jacob Haimes, an AI safety researcher, and Mikita Balesni, a research scientist at Apollo Research.

Here are the quick statistics for the hackathon:

  • 🙌 189 participants
  • 🏆 21 final entries
  • 💰 $2,000 in prizes awarded

Sandbagging Detection through Model Degradation

By Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahani

This team introduced an innovative approach to detect AI sandbagging, a form of deception where an AI model strategically underperforms during evaluation to hide its true capabilities. They discovered that adding noise to a model's parameters impairs the deceptive capabilities of sandbagging models while leaving a significant portion of tested capabilities intact, in stark contrast to models that are behaving honestly.

The team developed a straightforward, inexpensive technique that doesn't require prior assumptions about how sandbagging is represented within the model. They successfully detected sandbagging behavior in two open-source LLMs: An 8B Llama 3 Instruct model and Phi 3 small instruct model. This approach offers a practical and accessible method for identifying potential deception in AI systems, which could be crucial for ensuring the trustworthiness of AI in various applications.

Read the full project here.

Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering

By Avyay M Casheekar, Kaushik Sanjay Prabhakar, Kanishk Rath, Sienka Dounia

This team focused on detecting and mitigating sandbagging in AI models using a novel approach. They fine-tuned a Llama 3 8B model on a custom dataset of deceptive and honest scenarios, then used Linear Artificial Tomography scans to detect deceptive behaviors in the internal representations of the LLM.

The team's technique not only identified deceptive behaviors but also showed promise in mitigating them. Importantly, their approach demonstrated good generalization to out-of-distribution scenarios, suggesting its potential robustness in real-world applications.

Read the full project here.

Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach

By Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy

This team tackled the challenge of detecting deception in closed-source, black-box models like GPT-3.5-turbo. They investigated whether LLM API response metadata such as response time, token count, and tokens per second could be used to detect intentional deceptive behavior.

The team analyzed 300 prompts and generated 1200 responses from GPT-3.5-turbo, comparing the metadata of baseline and deceptive outputs. Their results revealed that deceptive outputs tend to have increased response times and altered token usage. This innovative approach demonstrates that deception detection is possible without accessing a model's internal representation, opening up new avenues for monitoring and safeguarding AI systems, even when their inner workings are not accessible.

Read the full project here.

Modelling the Oversight of Automated Interpretability Against Deceptive Agents on Sparse Autoencoders

By Simon Lermen, Mateusz Dziemian

This project demonstrated that AI models can not only be intentionally deceptive but can also have their deception go undetected by automated detection models while still achieving high scores on tasks and benchmarks. This revelation highlights the sophistication of potential AI deception and the challenges in detecting it.

Read the full project here.

Honorable Mentions

Several other projects stood out for their innovative approaches:

1. Werewolf Benchmark

By Luhan Mikaelson, Zach Nguyen, Andy Liu, Jord Nguyen, Akash Kundu

This project introduced a game-based technique to measure deception:

  • Created a benchmark to measure strategic deception in LLMs using the Werewolf game
  • Ran 500 games with GPT and Claude agents
  • Found that state-of-the-art models perform no better than random baselines and remain incapable of collaborative deception

Read the full project here.

2. Boosting Language Model Honesty with Truthful Suffixes

By Smitty van Bodegom, Giles Edkins, Annie Szorkin

This project explored a novel approach to reducing deception:

  • Investigated constructing truthful suffixes to elicit more truthful responses from models
  • Extended prior work on adversarial suffixes for jailbreaking to promote honest behavior
  • Developed a black-box method for reducing deception

Read the full project here.

3. Detecting Lies of (C)omission

By Ilan Moscovitz, Nikhil Kotecha, Sam Svenningsen

This team focused on a subtle form of deception:

  • Investigated "deceptive omission" as a form of non-lying deceptive behavior
  • Modified and generated a dataset targeted towards identifying deception that doesn't involve outright lying

Read the full project here.

4. Evaluating Steering Methods for Deceptive Behavior Control in LLMs

By Casey Hird, Basavasagar Patil, Tinuade Adeleke, Adam Fraknoi, Neel Jay

This project contributed to both detection and control of deceptive behaviors:

  • Used state-of-the-art steering methods to find and control deceptive behaviors in LLMs
  • Released a new deception dataset
  • Demonstrated the significance of dataset choice and prompt formatting when evaluating the efficacy of steering methods

Read the full project here.

Participant Testimonials

The impact of the AI Deception Hackathon extended beyond the projects themselves. Here's what some of our participants had to say about their experience:

Cam Tice, Recent Biology Graduate:

"The Apart Hackathon was my first opportunity leading a research project in the field of AI safety. To my surprise, in around 40 hours of work I was able to put together a research team, robustly test a safety-centered idea, and present my findings to researchers in the field. This sprint has (hopefully) served as a launch pad for my career shift."

Fedor Ryzhenkov, AI Safety Researcher at Palisade Research:

"AI Deception Hackathon has been my first hackathon, so it was very exciting. To win it was also great, and I expect this to be a big thing on my resume until I get something bigger there."

Siddharth Reddy Bakkireddy, Research Participant:

"Winning 3rd place at Apart Research's deception detection hackathon was a game-changer for my career. The experience deepened my passion for AI safety and resulted in a research project I'm proud of. I connected with like-minded individuals, expanding my professional network. This achievement will undoubtedly boost my prospects for internships and jobs in AI safety. I'm excited to further explore this field and grateful for the opportunity provided by Apart Research."

Closing Thoughts

We extend our heartfelt thanks to Apollo Research and all the participants, judges, and speakers who made this event a resounding success. The projects developed during this hackathon will contribute to the ongoing effort to create AI systems that are not just powerful, but also safe and aligned with human values.

Upcoming: Research Augmentation Hackathon

Building on the success of our AI Deception Hackathon, we're excited to announce that our Research Augmentation Hackathon is just 3 days away! This event aims to revolutionize AI alignment research by developing innovative tools and VS Code extensions to boost researcher productivity.

Why join? Like our previous hackathon, this event offers a unique opportunity to collaborate with brilliant minds, tackle real-world AI safety challenges, and potentially kickstart your career in AI alignment. Plus, with a $2,000 prize pool, you could turn your ideas into award-winning projects!

Don't miss this chance to shape the future of AI research. Register now and be part of the next big breakthrough in AI safety!

Dive deeper