Nov 24, 2024
Auto Prompt Injection
Yingjie Hu, Daniel Williams, Carmen Gavilanes, William Hesslefors Nairn
Summary
Prompt injection attacks exploit vulnerabilities in how large language models (LLMs) process inputs, enabling malicious behaviour or unauthorized information disclosure. This project investigates the potential for seemingly benign prompt injections to reliably prime models for undesirable behaviours, leveraging insights from the Goodfire API. Using our code, we generated two types of priming dialogues: one more aligned with the targeted behaviours and another less aligned. These dialogues were used to establish context before instructing the model to contradict its previous commands. Specifically, we tested whether priming increased the likelihood of the model revealing a password when prompted, compared to without priming. While our initial findings showed instability, limiting the strength of our conclusions, we believe that more carefully curated behaviour sets and optimised hyperparameter tuning could enable our code to be used to generate prompts that reliably affect model responses. Overall, this project highlights the challenges in reliably securing models against inputs, and that increased interpretability will lead to more sophisticated prompt injection.
Cite this work:
@misc {
title={
Auto Prompt Injection
},
author={
Yingjie Hu, Daniel Williams, Carmen Gavilanes, William Hesslefors Nairn
},
date={
11/24/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}