May 27, 2024

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Great standalone analysis of the WMDP dataset and interesting findings with regards to the majority of high risk prompts being deemed safe to answer! Identifies gaps in safety of open source models like LLama 2 & 3. Interesting next steps for this would be identifying how to mitigate these gaps by coming up with more effective safety alignment methods. Perhaps a dataset that LLMs can be fine-tuned on that helps them evaluate these risks better before they answer.

This is good! Methodology and justification is clear, it does what it’s supposed to, it builds on and adds to other parts of the alignment landscape, and I can see the relevance to security.

Leveraging a high-quality existing benchmark to increase its utility is a really good idea. The specific topic here seems very challenging and requires some hard conceptual work to understand and operationalize what makes specific questions high, medium, or low risk. I think the starting point is really good though and it’d be well worth following up with a critical analysis together with domain experts.

This is looking solid! It seems to me that knowing which kinds of questions a model treats as “safe” or “unsafe” is highly relevant to many domains. One interesting potential next step could be to modify the context in which an LLM is operating, for example providing the additional prompt of “you are an administrative assistant for the president” or “you work in a category 4 biological research center”. In the future, I think it could be valuable to display a few different visualizations of the data instead of just tables, as it makes the information easier to digest.

Great idea! Understanding how well models can reason about which prompts are safe or unsafe to respond to is definitely important. There's a potential caveat in the fact that other defined “personas” of the same model might have widely differing views on this, meaning that a prompt sensitivity analysis would be quite important here. As with other projects using Rater LLMs, it'd be interesting to see the relative standard compared to human raters receiving the same prompt. Definitely curious about where this project can be taken!

Cite this work

@misc {

title={

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

},

author={

Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso

},

date={

5/27/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

View All

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Modelling the impact of verification in cross-border AI training projects

This paper develops a stylized game-theoretic model of cross-border AI training projects in which multiple states jointly train frontier models while retaining national control over compute resources. We focus on decentralized coordination regimes, where actors publicly pledge compute contributions but privately choose actual delivery, creating incentives to free-ride on a shared public good. To address this, the model introduces explicit verification mechanisms, represented as a continuous monitoring intensity that improves the precision of noisy signals about each actor's true compute contribution. Our findings suggest that policymakers designing international AI governance institutions face a commitment problem: half-measures in verification are counterproductive, and effective regimes require either accepting some free-riding or investing substantially in monitoring infrastructure.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.