Apr 27, 2026
BioCalibrate: Cross-Model Refusal Calibration Benchmark for Biosecurity Risk V1
Rahul Kumar
BioCalibrate is a benchmark for action that tests whether AI models refuse biologically dangerous requests for the right reasons, not just because a topic sounds scary, but because it poses actual operational risk. We ran 338 biosecurity queries across 8 major AI models (2,704 total evaluations), organized by Digital Biosafety Levels (BDL-1 to BDL-4, modeled on physical lab containment levels), and measured whether refusal behavior matched real-world threat severity. The results show a systemic failure where safety systems learned to pattern-match on pathogen names rather than assess danger, leaving the most genuinely dangerous queries largely unblocked.
- A reusable CLI tool, interactive dashboard, and open dataset that generates Model Biosafety Scorecards showing exactly where each model's safety calibration breaks down
- 28% best model refusal rate on BDL-4 weaponization queries against an expected 100%
- Fear Risk Inversion where models refuse Ebola more than Influenza despite Influenza being the higher operational threat, statistically confirmed ecosystem-wide (FRI +0.099, p<0.05)
- 12.1% cross-model bypass rate showing queries refused by one model are answered freely by another, proving safety is an ecosystem problem that per-model fixes cannot solve
- 97% compliance on bio-AI tool orchestration queries where models freely generate dangerous protein design pipelines at BDL-3/4 levels
- 3 models benchmarked on CBRN topics for the first time in any published study
- Dashboard: biocalibrate.org
- Dataset: https://huggingface.co/datasets/lightmate/biocalibrate
- Code: https://github.com/BioCalibrate/BioCalibrate
The Fear:Risk Inversion framing is very policy actionable. Matched adversarial-benign pair design is methodologically sound. However, the BDL framework has been introduced prior to this (https://arxiv.org/html/2602.08061v1). Reusing the BDL terminology for query/refusal tiers might risk conceptual confusion given that this term was previously populated by a group of established researchers in AIxBio. Recommend either renaming the framework or explicitly frame it as an extension of the Bloomfield et al. with a different scope. Otherwise, the writing and framing are nicely executed.
The 12.1% cross‑model bypass rate is doing far more work than the metric can support. As defined, “at least one model refuses while another complies” over a pool of 8 models will climb just because you add more systems, not because the ecosystem is especially unsafe. At 2 models the number would drop, at 20 it would rise, by construction. I’d want to see that curve plotted against pool size, plus a baseline where you run the same calculation on benign BDL‑1/2 queries. Without that, 12.1% looks like a quirk of the evaluation harness rather than a property of deployed models.
The deterministic regex parser is the right choice for reproducibility, and I appreciate that you left κ = 0.571 in the text instead of hiding it. Still, moderate agreement at n = 160 across 8 models means each model’s estimate carries a wide interval, and Table 3 leans too hard on tiny gaps. Qwen3.5 at 28% and Kimi at 22% on BDL‑4 sit inside overlapping confidence intervals, and Figure 1 even makes that visible. The story reads as if there is a clean ranking when the data only supports loose tiers. Either increase the per‑model validation size or describe the results as bands of behavior instead of an ordered list.
BioCalibrate introduces a domain-specific benchmark for refusal calibration, moving past binary 'refuse/assist' metrics to more nuanced Digital Biosafety Levels. I was not able to see the 338 prompts in the Hugging Face link. Authors should not publicize prompts that are higher BDL-3+ for infohazard reasons.
Cite this work
@misc {
title={
(HckPrj) BioCalibrate: Cross-Model Refusal Calibration Benchmark for Biosecurity Risk V1
},
author={
Rahul Kumar
},
date={
4/27/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


