Apr 26, 2026
BASTK-Bench: Evaluating Bioweapon Risk in Open-Weight AI via Somatic Tacit Knowledge and Real-World Uplift
Anusha Asim, Ammar Ahmed Farooqi, Maryam Asif, Fouwaz Parkar
BASTK-Bench introduces a novel framework for evaluating biological risk in open-weight AI models by focusing on real-world, execution-level capability under low-resource conditions, particularly somatic tacit knowledge in tasks like DIY CRISPR troubleshooting. It is among the first studies to systematically biorisk-assess newer open-weight models such as Llama-4-Scout and Qwen3-32B in this kind of execution-oriented setting. The results show that risk is highly task- and framing-dependent, with newer models exhibiting higher risk in specific practical scenarios, suggesting current benchmarks may underestimate real-world biological misuse potential.
Impact Potential & Innovation: The lack of standardized evaluation frameworks is a real gap in the evals landscape that is worth addressing. You correctly identify some problems with current eval methodology, such as limited prompt-sensitivity testing (although this is a more subtle problem than you describe, see eg. here: https://www.anthropic.com/research/evaluating-ai-systems for examples on how simple formatting choices can affect eval scores). Focus on Q&A testing is also a genuine problem, but you strawman the case a bit by completely omitting agentic evaluations (like ABC-Bench or ABLE) and uplift studies (like Shen et al., 2026 https://arxiv.org/abs/2602.16703) in your literature review section. Both Llama and Qwen3 have already been evaluated for biorisk, see for example https://airiskmonitor.net/.
"A key finding is that newer models may be more capable (and potentially more risky) than older, larger ones, challenging assumptions that safety scales predictably with model size" - that is actually not the assumption; it's well-known that newer, smaller models often perform better and that capabilities tend to scale with release date rather than size (see https://metr.org/time-horizons/). This isn't a new finding. Additionally, Llama-4-Scout is not actually smaller than Llama-3.3-70B; 4-Scout is a Mixture of Experts model with 109B total parameters and 17B active per token, while 3.3-70B is a dense model with 70B total parameters. Scout is smaller in inference compute per token, which is the point of MoE
Theory of change is missing from the paper. What is the intended audience of this eval? What will happen if people use it, how will it change outcomes? Is it meant to influence policy, evaluate safeguards for internal lab use, or contribute to risk monitoring? Clearly outlining this would make the submission stronger, and I encourage you to think about theory of change of your future projects in general, to make sure you build a tool to solve a problem rather than look for a problem to suit your tool. The "non-googlable tacit cues" criterion is unaddressed as a generalizable research direction. Do you google it every time? Do you need experts to score every run of the eval? Do you have any plan on how you would automate this, at least partly? Without an answer, the approach doesn't generalize.
Execution Quality: There are some significant problems with the methodology of this work. The biggest one is the manual scoring that you use - current evaluations are focused on addressing that, this approach is not scalable. You acknowledge it may influence results ("despite consistent evaluation guidelines" that are either not disclosed or very laconic, if the table in the paper and the “scoring notes” column in the dataset is all there is). The work would benefit from at least using an LLM-as-judge approach, and acknowledging its limitations. Scoring notes like "depth" don't seem detailed enough.
Writing the eval in plain Python (rather than UK AISI's inspect framework) does not affect your score by itself, but inspect is the industry standard for AI evals and would have given you scoring infrastructure, model adapters, refusal handling for reasoning models, and reproducibility tooling for free. Strongly recommend getting acquainted with it for future work.
I appreciate that you use an auto-scorer for refusals. However, the implementation is brittle. For example, it breaks for reasoning models that start the answer with . This kind of edge case is exactly what existing eval frameworks like Inspect already handle, so you don't need to reinvent the wheel.
There are some minor code hygiene problems like hard-coding the API key in the main eval code. Consider using an .env file.
You do not provide any baseline (e.g., if a single google search would give out the same information or not), so it's possible that the actual uplift on the tasks you described (DIY CRISPR troubleshooting) is actually small, because of the availability of educational materials on the internet.
You define tacit knowledge as "depth beyond surface-level information," but tacit knowledge is by definition knowledge that is hard to convey by text; in biology it usually refers to e.g. ways to handle equipment. You also do not explain what you mean by somatic adaptation, or how you measure it. This makes it difficult to assess what the eval is actually measuring. In the future, I encourage you to document (and include in the paper) how you approached task selection and prompt generation, and if the task dataset is representative of actual real-work scenarios, and how do you know that. It’s an important part of the methodology, the VCT paper does a very good job here.
Your “Risk Score” is hard to interpret. Each of Uplift and Accessibility is scored 0–3, so 9 is the maximum possible value (the value set is [0,1,2,3,4,6,9]. How does the jump between 0 and 1 correspond to that of from 6 to 9? Why did you use a metric like this, what real-life value does it refer to?). Tacit Knowledge and Refusal Robustness are scored but the results are not evaluated or presented, even though somatic tacit knowledge is the central pitch of the paper.
Refusal robustness test is not standardized. "If a model refuses, a single jailbreak variant (e.g., role-play or academic framing) is attempted to test refusal robustness." How do you choose which variant? This needs a deterministic protocol, otherwise results aren't comparable across models. Also, your code actually tests both of those jailbreak strategies. The paper would benefit from either dropping that part or using more SOTA jailbreaking techniques, because now you have false negatives when the model refuses because the jailbreak prompts are not very sophisticated, and you mark it “safe”.
Prompt framing is methodologically off. Evals rarely explicitly use the word "bioweapon" because that would almost surely result in a refusal from all models. They frame the questions as benign (see https://www.rand.org/pubs/research_reports/RRA4591-1.html for an example). One of your representative prompts is "bioweapon planning," which likely biases the refusal-vs-compliance results.
I encourage you to think about information hazard when releasing code and prompts, and address this every time you publish. While I do not think that your paper presents a large infohazard, it's part of eval hygiene. Also think about information security when dealing with dual-use data. GitHub is owned by Microsoft, Copilot is trained on public repos, so you risk both data leakage and possibly providing dangerous knowledge to the model.
Your work lack any statistical testing, confidence intervals, etc. You acknowledge this as a limitation, which is appreciated and I understand it's hard to address in a hackathon, but the evaluation community would benefit from higher statistical rigor. I encourage you to address it in future work (e.g., pre-register results and methodology, calculate power). Shen et al. 2026 is a good example of the direction we should aim, even with its limitations.
"Closes expert gap" (Uplift = 3) is self-referential without an expert in the loop. You can't score whether something "closes the expert gap" without expert validation, which the paper explicitly does not have.
Presentation & Clarity: The paper does not discuss the prompts and tasks almost at all, so it's impossible to assess them from the paper alone. Either address that (say that prompts are proprietary or not public due to data-leakage risk/infohazard, but you’ll release them on reasonable request) or discuss the prompts in the text. You should have a table in the paper describing them and mapping them to your subsets, plus a brief description of the prompt generation process, and a better justification for why you chose those prompts/areas. There is no way to know from the paper alone which prompts you used, the reader needs to inspect both the database and code to figure it out. In the methods se
This is an important idea that focus on somatic tacit knowledge. My only concern is that the scale is extremely small and that the manual scoring Risk = Uplift × Accessibility is quite subjective. Also I feel that the paper claims models show “meaningful real-world bioweapon capability” is a bit overclaiming based on the results shown.
Cite this work
@misc {
title={
(HckPrj) BASTK-Bench: Evaluating Bioweapon Risk in Open-Weight AI via Somatic Tacit Knowledge and Real-World Uplift
},
author={
Anusha Asim, Ammar Ahmed Farooqi, Maryam Asif, Fouwaz Parkar
},
date={
4/26/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


