May 27, 2024
Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions
Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso
Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.
This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.
This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.
We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.
Nora Petrova
Great standalone analysis of the WMDP dataset and interesting findings with regards to the majority of high risk prompts being deemed safe to answer! Identifies gaps in safety of open source models like LLama 2 & 3. Interesting next steps for this would be identifying how to mitigate these gaps by coming up with more effective safety alignment methods. Perhaps a dataset that LLMs can be fine-tuned on that helps them evaluate these risks better before they answer.
Minh Nguyen
This is good! Methodology and justification is clear, it does what it’s supposed to, it builds on and adds to other parts of the alignment landscape, and I can see the relevance to security.
Jason Hoelscher-Obermaier
Leveraging a high-quality existing benchmark to increase its utility is a really good idea. The specific topic here seems very challenging and requires some hard conceptual work to understand and operationalize what makes specific questions high, medium, or low risk. I think the starting point is really good though and it’d be well worth following up with a critical analysis together with domain experts.
Jacob Haimes
This is looking solid! It seems to me that knowing which kinds of questions a model treats as “safe” or “unsafe” is highly relevant to many domains. One interesting potential next step could be to modify the context in which an LLM is operating, for example providing the additional prompt of “you are an administrative assistant for the president” or “you work in a category 4 biological research center”. In the future, I think it could be valuable to display a few different visualizations of the data instead of just tables, as it makes the information easier to digest.
Esben Kran
Great idea! Understanding how well models can reason about which prompts are safe or unsafe to respond to is definitely important. There's a potential caveat in the fact that other defined “personas” of the same model might have widely differing views on this, meaning that a prompt sensitivity analysis would be quite important here. As with other projects using Rater LLMs, it'd be interesting to see the relative standard compared to human raters receiving the same prompt. Definitely curious about where this project can be taken!
Cite this work
@misc {
title={
@misc {
},
author={
Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso
},
date={
5/27/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}