This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Reprogramming AI Models Hackathon
6710eab8447f62cdea3a653c
Reprogramming AI Models Hackathon
November 25, 2024
Accepted at the 
6710eab8447f62cdea3a653c
 research sprint on 

Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities

We present a method leveraging Sparse Autoencoder (SAE)-derived feature activations to identify and mitigate adversarial prompt hijacking in large language models (LLMs). By training a logistic regression classifier on SAE-derived features, we accurately classify diverse adversarial prompts and distinguish between successful and unsuccessful attacks. Utilizing the Goodfire SDK with the LLaMA-8B model, we explored latent feature activations to gain insights into adversarial interactions. This approach highlights the potential of SAE activations for improving LLM safety by enabling automated auditing based on model internals. Future work will focus on scaling this method and exploring its integration as a control mechanism for mitigating attacks.

By 
Hoang-Long Tran, Jack Kaunismaa, Edward Stevinson, Parv Mahajan, Oliver Clive-Griffin
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private