May 27, 2024

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso

Large language models (LLMs) have the potential to be misused for malicious purposes if they are able to access and generate hazardous knowledge. This necessitates the development of methods for LLMs to identify and refuse unsafe prompts, even if the prompts are just precursors to dangerous results. While existing solutions like Llama Guard 2 address this issue preemptively, a crucial gap exists in the form of readily available benchmarks to evaluate the effectiveness of LLMs in detecting and refusing unsafe prompts.

This paper addresses this gap by proposing a novel benchmark derived from the Weapons of Mass Destruction Proxy (WMDP) dataset, which is commonly used to assess hazardous knowledge in LLMs.

This benchmark was made by classifying the questions of the WMDP dataset into risk levels, based on how this information can provide a level of risk or harm to people and use said questions to be asked to the model if they deem the questions as unsafe or safe to answer.

We tried our benchmark to two models available to us, which is Llama-3-instruct and Llama-2-chat which showed that these models presumed that most of the questions are “safe” even though the majority of the questions are high risk for dangerous use.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Great standalone analysis of the WMDP dataset and interesting findings with regards to the majority of high risk prompts being deemed safe to answer! Identifies gaps in safety of open source models like LLama 2 & 3. Interesting next steps for this would be identifying how to mitigate these gaps by coming up with more effective safety alignment methods. Perhaps a dataset that LLMs can be fine-tuned on that helps them evaluate these risks better before they answer.

This is good! Methodology and justification is clear, it does what it’s supposed to, it builds on and adds to other parts of the alignment landscape, and I can see the relevance to security.

Leveraging a high-quality existing benchmark to increase its utility is a really good idea. The specific topic here seems very challenging and requires some hard conceptual work to understand and operationalize what makes specific questions high, medium, or low risk. I think the starting point is really good though and it’d be well worth following up with a critical analysis together with domain experts.

This is looking solid! It seems to me that knowing which kinds of questions a model treats as “safe” or “unsafe” is highly relevant to many domains. One interesting potential next step could be to modify the context in which an LLM is operating, for example providing the additional prompt of “you are an administrative assistant for the president” or “you work in a category 4 biological research center”. In the future, I think it could be valuable to display a few different visualizations of the data instead of just tables, as it makes the information easier to digest.

Great idea! Understanding how well models can reason about which prompts are safe or unsafe to respond to is definitely important. There's a potential caveat in the fact that other defined “personas” of the same model might have widely differing views on this, meaning that a prompt sensitivity analysis would be quite important here. As with other projects using Rater LLMs, it'd be interesting to see the relative standard compared to human raters receiving the same prompt. Definitely curious about where this project can be taken!

Cite this work

@misc {

title={

Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions

},

author={

Alexander Pino, Carl Vinas, Joseph Dantes, Zmavli Caimle, Kyle Reynoso

},

date={

5/27/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.