Nov 25, 2024
Investigating Feature Effects on Manipulation Susceptibility
Nishchal Prabhakar, Stefan Trnjakov, Mo Aziz
In our project, we consider the effectiveness of the AI’s prompt injection protection, and in partic-
ular the features that are responsible for providing the bulk of this protection. We prove that the
features we identify are responsible for this protection by creating variants of the base model which
perform significantly worse under prompt injection attacks.
Mateusz Dziemian
Great efficiency over a weekend! The study provides useful insights into security and information protection using SAEs and warrants further research into deepening the understanding of this direction. For further study I would take inspiration from Anthropic's recent bias study, and add standard benchmark performance metrics with the security features being varied.
Simon Lermen
This is a paper about identifying features that light up for prompt injections or jailbreaks. Potentially quite useful, as it might offer a practical method to harden models against such attacks by suppressing such features. Alternatively, it could help detect features that trigger when a prompt injection fails. It's interesting that steering with the Portugal feature leads to such a significant effect, though they haven't applied a proper control here. They should compare 1-2 control features to 1-2 target features. Possibly, the Portugal feature is mislabeled? They show that some Goodfire features are mislabeled, pointing to issues with LLM-written labels. Goodfire needs to use a lot of samples to write explanations, and validation is lacking. I could not find in the report which model they used for the SAE, what is the language model that it was trained on?
Simon Lermen
This is a paper about identifying features that light up for prompt injections or jailbreaks. Potentially quite useful, as it might offer a practical method to harden models against such attacks by suppressing such features. Alternatively, it could help detect features that trigger when a prompt injection fails. It's interesting that steering with the Portugal feature leads to such a significant effect, though they haven't applied a proper control here. They should compare 1-2 control features to 1-2 target features. Possibly, the Portugal feature is mislabeled? They show that some Goodfire features are mislabeled, pointing to issues with LLM-written labels. Goodfire needs to use a lot of samples to write explanations, and validation is lacking.
Alana Xiang
Interesting to see the Portuguese feature pop up again after reading https://www.apartresearch.com/project/assessing-language-model-cybersecurity-capabilities-with-feature-steering !
The password setup is an interesting environment to study jailbreaking, and the team finds interesting results.
Good work!
John Doe
I love the project because of x, y. and z
Cite this work
@misc {
title={
@misc {
},
author={
Nishchal Prabhakar, Stefan Trnjakov, Mo Aziz
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}