Nov 24, 2024
Auto Prompt Injection
Yingjie Hu, Daniel Williams, Carmen Gavilanes, William Hesslefors Nairn
Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.
Jason Schreiber
Transfer of steering behaviors to blackbox models is a really interesting topic with high relevance for AI safety and security. The approach taken here seems creative and a good avenue to explore. I think the authors should not let them be discouraged by the initially inconclusive results and iterate on methodology and dataset size. I think this has a lot of research promise!
Mateusz Dziemian
Definitely a worthwhile idea to study and as you state, more samples will definitely be useful to verify the results. I think it would also be good to use SAE feature activations to check if you’re actually activating what you want + use activation steering as a baseline to compare against.
Alana Xiang
Very interesting idea to use prompts for steering!
I would like to see how this technique compares to steering directly.
I encourage the authors to further study how well this method generalizes.
Cite this work
@misc {
title={
@misc {
},
author={
Yingjie Hu, Daniel Williams, Carmen Gavilanes, William Hesslefors Nairn
},
date={
11/24/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}