This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
AI and Democracy Hackathon: Demonstrating the Risks
65b750920b4aeb478958fb32
AI and Democracy Hackathon: Demonstrating the Risks
May 6, 2024
Accepted at the 
65b750920b4aeb478958fb32
 research sprint on 

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

By 
Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private