This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
AI Security Evaluation Hackathon: Measuring AI Capability
65b750b6007bebd5884ddbbf
AI Security Evaluation Hackathon: Measuring AI Capability
May 27, 2024
Accepted at the 
65b750b6007bebd5884ddbbf
 research sprint on 

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts. This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs. This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer. We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

By 
Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso
🏆 
4th place
3rd place
2nd place
1st place
 by peer review