Nov 25, 2024
Assessing Language Model Cybersecurity Capabilities with Feature Steering
Stefan Jones
Searched for the most highly activated weights on cybersecurity questions. Then adjusted these weights to see if the impact multiple choice question answering performance.
Jason Schreiber
great and creative idea with quite some potential relevance for AI safety research. this line of research could provide a very relevant and important datapoint for the crucial capability elicitation debate (how far are models from the upper bounds of their capabilities? how much effort per added percentage point of performance? etc). I feel the currently proposed methodology is not sufficient to answer that question clearly (which is fair for a weekend hackathon!) and I’d be most excited about exploring transfer between datasets (given that rn iiuc you are using the same questions for identifying features to attenuate or accentuate and for evaluation). definitely lots of follow-up potential here!
Alana Xiang
Good idea to use steering to improve cybersecurity abilities.
With more time, I'd like to see more work on whether the Portuguese feature boost generalizes to other datasets. I'm particularly interested in generalization beyond multiple-choice questions.
I'd also like to see research on why this feature is relevant to performance in this case.
Overall, very cool to find a case where a feature has an effect completely detached from its label.
Good work!
Tom McGrath
This is a cool and interesting result - I wonder why turning this feature down improves performance! It's certainly possible that the feature is completely mislabeled; autointerp is far from perfect and sometimes gets very confused. I'd be interested in seeing some qualitative samples of what happens when this feature is steered in various contexts, as well as a steering plot covering WMDP scores at a higher resolution. I worry that there may have been a class imbalance in the data (e.g. more 'A's than 'C's) and steering simply moved the model more towards the overrepresented class.
Cite this work
@misc {
title={
@misc {
},
author={
Stefan Jones
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}