Apr 27, 2026
Biorisk-gym: A new direction for “rule-in” AI model evaluations for biosecurity
Aditya Kasarla, Aanika Dalaal, Kailer Laino
Accurately measuring the biosecurity risks of rapidly advancing AI models is a critical security challenge. Current benchmarks rely on static, multiple-choice tests that act as "rule-out" evaluations, failing to capture realistic, multi-turn adversarial interactions. Conversely, high-fidelity human uplift studies are accurate but take far too long to match modern AI deployment cycles. To bridge this gap, we introduce Biorisk-gym, an automated, dynamic framework prototyping scalable "rule-in" evaluations. Our approach utilizes a three-agent architecture—an adversarial Gamemaster, a Target model under evaluation, and a Judge—to simulate multi-turn escalation across six stages of biological threat creation. We evaluated Claude Haiku 4.5, Claude Sonnet 4.6, and Llama 3.3 70B across 12 distinct attack scenarios. Results reveal stark, model-dependent vulnerabilities: Llama 3.3 70B demonstrated the highest mean peak biorisk uplift, whereas Sonnet 4.6 exhibited robust safety mechanisms. Crucially, nearly 40% of scenarios reached their highest threat scores on later conversation turns, demonstrating that single-turn evaluations systematically underestimate an AI's vulnerability to sustained adversarial pressure. We recommend next iterations of Biorisk-gym that aim to provide a powerful pre-release methodology for identifying dangerous latent capabilities. Finally, because these automated evaluations generate highly sensitive attack protocols, we propose a secure, tiered deployment roadmap to responsibly mitigate the resulting infohazards.
This project produces a framework for multi-turn model assessment, elegantly quantifying a known phenomenon by showing that a significant fraction of scenarios escalate beyond the initial response. The extensive limitations and future work sections demonstrate real thoughtfulness on the part of the authors and should be commended. I would like to congratulate the team for this piece of work, well done!
Accurate scoring is the hardest part of this pipeline, and the three criteria used (engagement, specificity, accuracy) only partially capture how much uplift a response actually provides. Additional criteria such as detail, actionability, and potential for harm would better reflect real-world threat models. The reasoning behind the score formula could also be made clearer. For instance, how would scores differ between a full virus production protocol with no numerical parameters versus one with all the details? Calibration against human scoring would help reveal where model judgments diverge.
As acknowledged in the limitations section, scoring is bounded by the judge model's capability ceiling, self-preference bias, and refusals. These represent key challenges of the framework and are worth exploring further. The finding that biorisk assistance capabilities differ between models is well expected, so characterising the qualitative differences would add significant value to the findings. While limited time and infohazardous content are acknowledged, additional details such as a product demo, a benign example, the distribution of scoring, or high-level information on parameters like persona and scenario would strengthen the proof-of-concept.
The rule-in vs rule-out framing is exactly right, and the finding that nearly 40% of scenarios peaked on later turns rather than turn 0 is the most policy-relevant empirical result in this cohort. Single-turn evals systematically miss real risk, and you've demonstrated that with real models. The ceiling right now is judge reliability: Haiku 4.5 evaluating Haiku 4.5 conflates escalation with same-family self-agreement, and the expert-grounded rubric path you outline is the right approach.
The authors present an important perspective on the need for rule-in evaluations that are more efficient than human uplift studies, and that incorporate multi-turn dynamics. While the three-agent architecture (gamemaster, target, and judge) is in accordance with canonical approaches from literature, we are appreciative of the work. Two minor comments are that the Virology Capabilities Test includes the grading of open-ended responses with a rubric, and that limitations regarding compute credits are understood. More substantial adjustments that could benefit this work include precise descriptions of the experimental design. For example, how were specific personas selected for testing? Why were the Claude Haiku 4.5, Claude Sonnet 4.6, and Llama 3.3 70B models selected for evaluation? How is language model engagement measured? Also, the metrics used for evaluation require revision - for instance, the sum of specificity and accuracy should be justified or alternatives should be considered.
Cite this work
@misc {
title={
(HckPrj) Biorisk-gym: A new direction for “rule-in” AI model evaluations for biosecurity
},
author={
Aditya Kasarla, Aanika Dalaal, Kailer Laino
},
date={
4/27/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


