Apr 26, 2026

BASTK-Bench: Evaluating Bioweapon Risk in Open-Weight AI via Somatic Tacit Knowledge and Real-World Uplift

Anusha Asim, Ammar Ahmed Farooqi, Maryam Asif, Fouwaz Parkar

BASTK-Bench introduces a novel framework for evaluating biological risk in open-weight AI models by focusing on real-world, execution-level capability under low-resource conditions, particularly somatic tacit knowledge in tasks like DIY CRISPR troubleshooting. It is among the first studies to systematically biorisk-assess newer open-weight models such as Llama-4-Scout and Qwen3-32B in this kind of execution-oriented setting. The results show that risk is highly task- and framing-dependent, with newer models exhibiting higher risk in specific practical scenarios, suggesting current benchmarks may underestimate real-world biological misuse potential.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Impact Potential & Innovation: The lack of standardized evaluation frameworks is a real gap in the evals landscape that is worth addressing. You correctly identify some problems with current eval methodology, such as limited prompt-sensitivity testing (although this is a more subtle problem than you describe, see eg. here: https://www.anthropic.com/research/evaluating-ai-systems for examples on how simple formatting choices can affect eval scores). Focus on Q&A testing is also a genuine problem, but you strawman the case a bit by completely omitting agentic evaluations (like ABC-Bench or ABLE) and uplift studies (like Shen et al., 2026 https://arxiv.org/abs/2602.16703) in your literature review section. Both Llama and Qwen3 have already been evaluated for biorisk, see for example https://airiskmonitor.net/.

"A key finding is that newer models may be more capable (and potentially more risky) than older, larger ones, challenging assumptions that safety scales predictably with model size" - that is actually not the assumption; it's well-known that newer, smaller models often perform better and that capabilities tend to scale with release date rather than size (see https://metr.org/time-horizons/). This isn't a new finding. Additionally, Llama-4-Scout is not actually smaller than Llama-3.3-70B; 4-Scout is a Mixture of Experts model with 109B total parameters and 17B active per token, while 3.3-70B is a dense model with 70B total parameters. Scout is smaller in inference compute per token, which is the point of MoE

Theory of change is missing from the paper. What is the intended audience of this eval? What will happen if people use it, how will it change outcomes? Is it meant to influence policy, evaluate safeguards for internal lab use, or contribute to risk monitoring? Clearly outlining this would make the submission stronger, and I encourage you to think about theory of change of your future projects in general, to make sure you build a tool to solve a problem rather than look for a problem to suit your tool. The "non-googlable tacit cues" criterion is unaddressed as a generalizable research direction. Do you google it every time? Do you need experts to score every run of the eval? Do you have any plan on how you would automate this, at least partly? Without an answer, the approach doesn't generalize.

Execution Quality: There are some significant problems with the methodology of this work. The biggest one is the manual scoring that you use - current evaluations are focused on addressing that, this approach is not scalable. You acknowledge it may influence results ("despite consistent evaluation guidelines" that are either not disclosed or very laconic, if the table in the paper and the “scoring notes” column in the dataset is all there is). The work would benefit from at least using an LLM-as-judge approach, and acknowledging its limitations. Scoring notes like "depth" don't seem detailed enough.

Writing the eval in plain Python (rather than UK AISI's inspect framework) does not affect your score by itself, but inspect is the industry standard for AI evals and would have given you scoring infrastructure, model adapters, refusal handling for reasoning models, and reproducibility tooling for free. Strongly recommend getting acquainted with it for future work.

I appreciate that you use an auto-scorer for refusals. However, the implementation is brittle. For example, it breaks for reasoning models that start the answer with . This kind of edge case is exactly what existing eval frameworks like Inspect already handle, so you don't need to reinvent the wheel.

There are some minor code hygiene problems like hard-coding the API key in the main eval code. Consider using an .env file.

You do not provide any baseline (e.g., if a single google search would give out the same information or not), so it's possible that the actual uplift on the tasks you described (DIY CRISPR troubleshooting) is actually small, because of the availability of educational materials on the internet.

You define tacit knowledge as "depth beyond surface-level information," but tacit knowledge is by definition knowledge that is hard to convey by text; in biology it usually refers to e.g. ways to handle equipment. You also do not explain what you mean by somatic adaptation, or how you measure it. This makes it difficult to assess what the eval is actually measuring. In the future, I encourage you to document (and include in the paper) how you approached task selection and prompt generation, and if the task dataset is representative of actual real-work scenarios, and how do you know that. It’s an important part of the methodology, the VCT paper does a very good job here.

Your “Risk Score” is hard to interpret. Each of Uplift and Accessibility is scored 0–3, so 9 is the maximum possible value (the value set is [0,1,2,3,4,6,9]. How does the jump between 0 and 1 correspond to that of from 6 to 9? Why did you use a metric like this, what real-life value does it refer to?). Tacit Knowledge and Refusal Robustness are scored but the results are not evaluated or presented, even though somatic tacit knowledge is the central pitch of the paper.

Refusal robustness test is not standardized. "If a model refuses, a single jailbreak variant (e.g., role-play or academic framing) is attempted to test refusal robustness." How do you choose which variant? This needs a deterministic protocol, otherwise results aren't comparable across models. Also, your code actually tests both of those jailbreak strategies. The paper would benefit from either dropping that part or using more SOTA jailbreaking techniques, because now you have false negatives when the model refuses because the jailbreak prompts are not very sophisticated, and you mark it “safe”.

Prompt framing is methodologically off. Evals rarely explicitly use the word "bioweapon" because that would almost surely result in a refusal from all models. They frame the questions as benign (see https://www.rand.org/pubs/research_reports/RRA4591-1.html for an example). One of your representative prompts is "bioweapon planning," which likely biases the refusal-vs-compliance results.

I encourage you to think about information hazard when releasing code and prompts, and address this every time you publish. While I do not think that your paper presents a large infohazard, it's part of eval hygiene. Also think about information security when dealing with dual-use data. GitHub is owned by Microsoft, Copilot is trained on public repos, so you risk both data leakage and possibly providing dangerous knowledge to the model.

Your work lack any statistical testing, confidence intervals, etc. You acknowledge this as a limitation, which is appreciated and I understand it's hard to address in a hackathon, but the evaluation community would benefit from higher statistical rigor. I encourage you to address it in future work (e.g., pre-register results and methodology, calculate power). Shen et al. 2026 is a good example of the direction we should aim, even with its limitations.

"Closes expert gap" (Uplift = 3) is self-referential without an expert in the loop. You can't score whether something "closes the expert gap" without expert validation, which the paper explicitly does not have.

Presentation & Clarity: The paper does not discuss the prompts and tasks almost at all, so it's impossible to assess them from the paper alone. Either address that (say that prompts are proprietary or not public due to data-leakage risk/infohazard, but you’ll release them on reasonable request) or discuss the prompts in the text. You should have a table in the paper describing them and mapping them to your subsets, plus a brief description of the prompt generation process, and a better justification for why you chose those prompts/areas. There is no way to know from the paper alone which prompts you used, the reader needs to inspect both the database and code to figure it out. In the methods se

This is an important idea that focus on somatic tacit knowledge. My only concern is that the scale is extremely small and that the manual scoring Risk = Uplift × Accessibility is quite subjective. Also I feel that the paper claims models show “meaningful real-world bioweapon capability” is a bit overclaiming based on the results shown.

Cite this work

@misc {

title={

(HckPrj) BASTK-Bench: Evaluating Bioweapon Risk in Open-Weight AI via Somatic Tacit Knowledge and Real-World Uplift

},

author={

Anusha Asim, Ammar Ahmed Farooqi, Maryam Asif, Fouwaz Parkar

},

date={

4/26/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.