Nov 25, 2024
Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities
Hoang-Long Tran, Jack Kaunismaa, Edward Stevinson, Parv Mahajan, Oliver Clive-Griffin
Summary
We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.
Cite this work:
@misc {
title={
Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities
},
author={
Hoang-Long Tran, Jack Kaunismaa, Edward Stevinson, Parv Mahajan, Oliver Clive-Griffin
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}