This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Deception Detection Hackathon: Preventing AI deception
660d65646a619f5cf53b1f56
Deception Detection Hackathon: Preventing AI deception
July 1, 2024
Accepted at the 
660d65646a619f5cf53b1f56
 research sprint on 

Sandbagging LLMs using Activation Steering

As advanced AI systems continue to evolve, concerns about their potential risks and misuses have prompted governments and researchers to develop safety benchmarks to evaluate their trustworthiness. However, a new threat model has emerged, known as "sandbagging," where AI systems strategically underperform during evaluation to deceive evaluators. This paper proposes a novel method to induce and prevent sandbagging in LLMs using activation steering, a technique that manipulates the model's internal representations to suppress or exemplify certain behaviours. Our mixed results show that activation steering can induce sandbagging in models, but struggles with removing the sandbagging behaviour from deceitful models. We also highlight several limitations and challenges, including the need for direct access to model weights, increased memory footprint, and potential harm to model performance on general benchmarks. Our findings underscore the need for more research into efficient, scalable, and robust methods for detecting and preventing sandbagging, as well as stricter government regulation and oversight in the development and deployment of AI systems.

By 
Davide Zani, Jeremias Ferrao
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission is under review.
Oops! Something went wrong while submitting the form.

This project is private