Nov 25, 2024
Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities
Hoang-Long Tran, Jack Kaunismaa, Edward Stevinson, Parv Mahajan, Oliver Clive-Griffin
We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.
Liv Gorton
This is addressing a really important practical problem. I liked their approach, am impressed with the results, and would be keen to see them build on this work!
Tom McGrath
This work covers an important problem and applies a sensible methodology. The performance of the results is impressive - I had to check in the code that the results were in fact on a test set. I'd be interested in seeing how often harmless prompts are misclassified though. Definitely worth extending further - these results are quite promising.
Cite this work
@misc {
title={
@misc {
},
author={
Hoang-Long Tran, Jack Kaunismaa, Edward Stevinson, Parv Mahajan, Oliver Clive-Griffin
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}