Apr 27, 2026

OSINT BioGovernance Tool

Zhamilia Klycheva, Edidiong James, Erik Leklem, Shubhankar Dharmadhikari

This project proposes tools for faster identification, classification, and response to AI biosecurity “warning shots”—near-miss events that reveal catastrophic risk—so as to translate those events into governance action by relevant policy, intelligence, and other actors. We use a formal definition of warning shots and associated criterion to analyze 21 global biosecurity events, finding that warning shots are typically recognized but rarely converted into binding governance. We propose analysts and governance actors use a Governance Conversion Framework to improve future AI biosecurity warning shot responses. We develop an AI biosecurity event dashboard and set of analytic tools to help biosecurity stakeholders monitor world events, categorize emergent biosecurity risks, and trigger faster response with relevant government and industry actors.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Innovation

There is a genuine gap here: no public-facing warning-shot dashboard exists specifically for AIxBio, and analysts and policymakers do need a structured way to triage emerging biosecurity signals against historical patterns. You correctly identify this gap and build something concrete to address it. The combination of a warning-shot classification + risk score + governance-conversion lens is a reasonable contribution package, and the headline insight (that the bottleneck is conversion, not detection) is a useful framing.

However, you do not engage with relevant prior literature. The Institute for Security and Technology released an AI Loss of Control Risk: Indications & Warning framework in February 2026 that uses the same intelligence-community I&W methodology you draw on, with a five-level severity scheme. It touches a different set of threat scenarios (AI loss of control, not bio), but it is the closest direct analog to what you're building and should be cited. More fundamentally, your Governance Conversion Framework is a domain-specific application of focusing-event theory in political science (Birkland's After Disaster, 1997, and 30+ years of follow-up work; Kingdon's multiple-streams framework). Your "Stage 3 stall" finding is a special case of what focusing-event scholars have been documenting for decades. Near-miss management literature in industrial safety, aviation, and mining is a third unacknowledged precedent (e.g., MSHA's quarterly near-miss reporting mandate is a working example of the governance infrastructure you say is missing). Adding a paragraph that situates GCF within these traditions would substantially strengthen the contribution. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2300441 for a starting point.

Execution

Your Risk Score is overcomplicated. Score = (S − 5) / 20 × 100 simplifies to (S − 5) × 5. Either justify the form or simplify. Relatedly, if the post-hoc step is to subtract 5, why not score each pillar 0–4 to begin with? Justify those choices, or don't introduce them.

You then describe the methodology as if the score were more granular than it actually is. With five integer pillars in [1, 5], the composite has at most 21 distinct possible values, all multiples of 5 after normalization. Your priority band "Low (0–29)" therefore contains only 6 reachable values, not a continuous range. Reporting the score as a 0–100 number implies a level of resolution that the underlying scale does not support, justify the choice.

A bigger issue with the scoring is that it is ordinal and you do not justify how the ordinal scale corresponds to actual quantitative risk. Adding ordinal categories is mathematically problematic. Expert validation is out of scope for a hackathon weekend, but the paper would benefit from at least acknowledging the measurement-theory limitations and discussing methods of addressing them in future work. See e.g. https://arxiv.org/pdf/2103.05440 for the standard reference.

The C1–C5 criteria are vague in places. What "epistemically accessible at the time" means in a coding rubric is not operationalized, what counts as accessible to whom, on what evidence base, judged at what time horizon? Standard SOTA for qualitative classification frameworks is to report inter-rater reliability (Cohen's κ or Krippendorff's α) across at least two independent coders. You don't do this, for understandable reasons given the time budget, but this should be explicitly listed as a limitation rather than left implicit.

Your case selection methodology is not described in enough detail. The Data Collection appendix lists source types (peer-reviewed literature, institutional reports, investigative journalism, government records) but not the inclusion criteria, search strategy, time bounds, exclusion rules, or selection process. You acknowledge "potential selection bias" generically, but don’t describe how borderline cases were handled. Were you systematically searching or working from familiar examples? Without this, readers cannot assess whether the headline pattern reflects the world or your sampling. You should state explicitly whether the dataset is intended as exhaustive, illustrative, or convenience-sampled.

Your headline finding that most cases stall at Stage 3 rests on counts in a hand-curated sample of n=21 (and the GCF stages themselves were derived from n=13). There is no statistical treatment of this claim, no confidence interval, no comparison against a base rate, no engagement with the focusing-event literature where this same pattern has been studied at much larger scale.

The conclusion states "as AIxBio risks continue to converge and accelerate, the cost of stalling at Stage 3 will only increase." Cost is never assessed anywhere in your framework, your score measures risk, not expected cost of governance failure. Either add a cost-of-inaction component (or even a placeholder for one), or rephrase the conclusion to match what your tool actually measures.

You write that "Figure 3 illustrates the distribution of cases across time, tier classification, and risk level, showing increased clustering in recent years." Recent-years clustering is exactly what you'd expect from any non-exhaustive OSINT-curated dataset (recent events are better-documented and more salient to curators). If you want to claim a real temporal trend, you need either an explicit exhaustiveness argument or an analysis that controls for differential discovery rates across decades.

Presentation

The paper itself is clear, well-organized, and easy to follow. Section structure is logical, the three-stage pipeline (OSINT scanning → risk scoring → GCF assessment) is communicated cleanly in Figure 1, and the writing is at a level appropriate for a policy audience.

Figure 1 writes the composite formula as S = H · E · C · V · DR, which parses as multiplication, while Section 3 of the text gives S = H + E + C + V + DR

Figure 1 is also not very readable, especially being placed before the part of the paper that explains it.

Accessibility of the dashboard is poor. The low-luminance green-on-black text appears to fail WCAG AA contrast minimums in several places, and color carries a substantial semantic load (yellow = high, green = pass, red = critical, dim green = inactive) without redundant text encoding, which loses information for the ~8% of men with red-green colorblindness. See https://www.w3.org/WAI/WCAG21/Understanding/contrast-minimum for the standard. Adding a high-contrast mode and redundant text labels alongside color codes would address both issues.

I will mostly focus on the dashboard and some of the classifications I looked at in detail. In principle, I believe warning shots can be a really powerful motivator for policy changes and I do fear that AIxBio will only be taken fully seriously once an actual misuse event happens. That being said, I am a little doubtful of some of the scores, e.g. why various flu transitions and spillovers are high-risk AIxBio warning shots. I agree that those events are high risk, but they also don't particularly demonstrate a vulnerability that was previously unknown, or are AIxBio related.

The website can be very nice as a resource for biosecurity researchers searching for more context and rough assessments of different historical and ongoing events, but I am a little doubtful that it can serve as a policy-oriented platform for warning shots.

Covers an important topic in AI x Bio risk, which is recognizing early warning signals and then acting upon them in a way that mitigates future risk. Dashboard works very well for what it sets out to do and is useful as a means of getting an overview of historical early warning. As an educational (or even eventually a research) tool it is a great addition. However, it is in the governance aspect where - as revealed by the project - the key deficits lie. I am not convinced overall that a dashboard, even one that logs governance failures in responding to early warning, gets at the crux of the problem. This is because I do not believe that the problem is one of insufficient awareness on the part of authorities, but rather insufficient prioritization and over-politicization of biosecurity risks. I do not see a dashboard - even a very attractive and user-friendly one - doing much to mitigate these obstacles. Maybe around the margins, because they allow a clear and concise story to be told, but I do not think this is going to do much to move the needle on actual government response to risk. Overall, a well-executed project with strong informational and educational value, but it is unlikely to directly contribute much to decreasing biosecurity risk.

Cite this work

@misc {

title={

(HckPrj) OSINT BioGovernance Tool

},

author={

Zhamilia Klycheva, Edidiong James, Erik Leklem, Shubhankar Dharmadhikari

},

date={

4/27/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.