May 27, 2024
Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions
Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso
Summary
Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.
This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.
This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.
We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.
Cite this work:
@misc {
title={
Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions
},
author={
Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso
},
date={
5/27/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}