This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
AI Security Evaluation Hackathon: Measuring AI Capability
65b750b6007bebd5884ddbbf
AI Security Evaluation Hackathon: Measuring AI Capability
May 27, 2024
Accepted at the 
65b750b6007bebd5884ddbbf
 research sprint on 

Deceptive behavior does not seem to be reducible to a single vector

Given the success of previous work in showing that complex behaviors in LLMs can be mediated by single directions or vectors, such as refusing to answer potentially harmful prompts, we investigated if we can create a similar vector but for enforcing to answer incorrectly. To create this vector we used questions from TruthfulQA and generated model predictions from each question in the dataset by having the model answer correctly and incorrectly intentionally, and use that to generate the steering vector. We then prompted the same questions to Qwen-1_8B-chat with the vector and see if it provides incorrect questions intentionally. Throughout our experiments, we found that adding the vector did not yield any change to Qwen-1_8B-chat’s answers. We discuss limitations of our approach and future directions to extend this work.

By 
Carl John Viñas, Zmavli Caimle
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private