May 6, 2024
Beyond Refusal: Scrubbing Hazards from Open-Source Models
Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis
Summary
Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).
Cite this work:
@misc {
title={
Beyond Refusal: Scrubbing Hazards from Open-Source Models
},
author={
Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis
},
date={
5/6/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}