May 6, 2024

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis

Details

Details

Arrow
Arrow
Arrow

Summary

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

Cite this work:

@misc {

title={

Beyond Refusal: Scrubbing Hazards from Open-Source Models

},

author={

Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis

},

date={

5/6/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Review

Review

Arrow
Arrow
Arrow

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

No reviews are available yet

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.