May 6, 2024
Beyond Refusal: Scrubbing Hazards from Open-Source Models
Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis
Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).
Simon Lermen
Results seem to support claim about unlearning. There are also other approaches to prevent misuse from open-models. https://arxiv.org/abs/2211.14946
Alternative to unlearning: https://arxiv.org/abs/2404.12699
When the paper refers to fine-tuning it seems to refer to the unlearning fine-tuning of harmful knowledge. Maybe the wording could sometimes be a bit more clear on this.
For the refusal vector there was this recent post:
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
I also am working on a post on refusal vectors in agentic systems.
Nina Rimsky
Interesting experiments, I liked the approach of applying more adversarial pressure to unlearning techniques. Would be interesting to run similar experiments on other unlearning techniques
Bart Bussmann
Great project! I think it’s really important to red-team AI safety methods and your project is a great stab at red-teaming unlearning!
Cite this work
@misc {
title={
@misc {
},
author={
Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis
},
date={
5/6/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}