Apr 27, 2026
OSINT BioGovernance Tool
Zhamilia Klycheva, Edidiong James, Erik Leklem, Shubhankar Dharmadhikari
This project proposes tools for faster identification, classification, and response to AI biosecurity “warning shots”—near-miss events that reveal catastrophic risk—so as to translate those events into governance action by relevant policy, intelligence, and other actors. We use a formal definition of warning shots and associated criterion to analyze 21 global biosecurity events, finding that warning shots are typically recognized but rarely converted into binding governance. We propose analysts and governance actors use a Governance Conversion Framework to improve future AI biosecurity warning shot responses. We develop an AI biosecurity event dashboard and set of analytic tools to help biosecurity stakeholders monitor world events, categorize emergent biosecurity risks, and trigger faster response with relevant government and industry actors.
Innovation
There is a genuine gap here: no public-facing warning-shot dashboard exists specifically for AIxBio, and analysts and policymakers do need a structured way to triage emerging biosecurity signals against historical patterns. You correctly identify this gap and build something concrete to address it. The combination of a warning-shot classification + risk score + governance-conversion lens is a reasonable contribution package, and the headline insight (that the bottleneck is conversion, not detection) is a useful framing.
However, you do not engage with relevant prior literature. The Institute for Security and Technology released an AI Loss of Control Risk: Indications & Warning framework in February 2026 that uses the same intelligence-community I&W methodology you draw on, with a five-level severity scheme. It touches a different set of threat scenarios (AI loss of control, not bio), but it is the closest direct analog to what you're building and should be cited. More fundamentally, your Governance Conversion Framework is a domain-specific application of focusing-event theory in political science (Birkland's After Disaster, 1997, and 30+ years of follow-up work; Kingdon's multiple-streams framework). Your "Stage 3 stall" finding is a special case of what focusing-event scholars have been documenting for decades. Near-miss management literature in industrial safety, aviation, and mining is a third unacknowledged precedent (e.g., MSHA's quarterly near-miss reporting mandate is a working example of the governance infrastructure you say is missing). Adding a paragraph that situates GCF within these traditions would substantially strengthen the contribution. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2300441 for a starting point.
Execution
Your Risk Score is overcomplicated. Score = (S − 5) / 20 × 100 simplifies to (S − 5) × 5. Either justify the form or simplify. Relatedly, if the post-hoc step is to subtract 5, why not score each pillar 0–4 to begin with? Justify those choices, or don't introduce them.
You then describe the methodology as if the score were more granular than it actually is. With five integer pillars in [1, 5], the composite has at most 21 distinct possible values, all multiples of 5 after normalization. Your priority band "Low (0–29)" therefore contains only 6 reachable values, not a continuous range. Reporting the score as a 0–100 number implies a level of resolution that the underlying scale does not support, justify the choice.
A bigger issue with the scoring is that it is ordinal and you do not justify how the ordinal scale corresponds to actual quantitative risk. Adding ordinal categories is mathematically problematic. Expert validation is out of scope for a hackathon weekend, but the paper would benefit from at least acknowledging the measurement-theory limitations and discussing methods of addressing them in future work. See e.g. https://arxiv.org/pdf/2103.05440 for the standard reference.
The C1–C5 criteria are vague in places. What "epistemically accessible at the time" means in a coding rubric is not operationalized, what counts as accessible to whom, on what evidence base, judged at what time horizon? Standard SOTA for qualitative classification frameworks is to report inter-rater reliability (Cohen's κ or Krippendorff's α) across at least two independent coders. You don't do this, for understandable reasons given the time budget, but this should be explicitly listed as a limitation rather than left implicit.
Your case selection methodology is not described in enough detail. The Data Collection appendix lists source types (peer-reviewed literature, institutional reports, investigative journalism, government records) but not the inclusion criteria, search strategy, time bounds, exclusion rules, or selection process. You acknowledge "potential selection bias" generically, but don’t describe how borderline cases were handled. Were you systematically searching or working from familiar examples? Without this, readers cannot assess whether the headline pattern reflects the world or your sampling. You should state explicitly whether the dataset is intended as exhaustive, illustrative, or convenience-sampled.
Your headline finding that most cases stall at Stage 3 rests on counts in a hand-curated sample of n=21 (and the GCF stages themselves were derived from n=13). There is no statistical treatment of this claim, no confidence interval, no comparison against a base rate, no engagement with the focusing-event literature where this same pattern has been studied at much larger scale.
The conclusion states "as AIxBio risks continue to converge and accelerate, the cost of stalling at Stage 3 will only increase." Cost is never assessed anywhere in your framework, your score measures risk, not expected cost of governance failure. Either add a cost-of-inaction component (or even a placeholder for one), or rephrase the conclusion to match what your tool actually measures.
You write that "Figure 3 illustrates the distribution of cases across time, tier classification, and risk level, showing increased clustering in recent years." Recent-years clustering is exactly what you'd expect from any non-exhaustive OSINT-curated dataset (recent events are better-documented and more salient to curators). If you want to claim a real temporal trend, you need either an explicit exhaustiveness argument or an analysis that controls for differential discovery rates across decades.
Presentation
The paper itself is clear, well-organized, and easy to follow. Section structure is logical, the three-stage pipeline (OSINT scanning → risk scoring → GCF assessment) is communicated cleanly in Figure 1, and the writing is at a level appropriate for a policy audience.
Figure 1 writes the composite formula as S = H · E · C · V · DR, which parses as multiplication, while Section 3 of the text gives S = H + E + C + V + DR
Figure 1 is also not very readable, especially being placed before the part of the paper that explains it.
Accessibility of the dashboard is poor. The low-luminance green-on-black text appears to fail WCAG AA contrast minimums in several places, and color carries a substantial semantic load (yellow = high, green = pass, red = critical, dim green = inactive) without redundant text encoding, which loses information for the ~8% of men with red-green colorblindness. See https://www.w3.org/WAI/WCAG21/Understanding/contrast-minimum for the standard. Adding a high-contrast mode and redundant text labels alongside color codes would address both issues.
I will mostly focus on the dashboard and some of the classifications I looked at in detail. In principle, I believe warning shots can be a really powerful motivator for policy changes and I do fear that AIxBio will only be taken fully seriously once an actual misuse event happens. That being said, I am a little doubtful of some of the scores, e.g. why various flu transitions and spillovers are high-risk AIxBio warning shots. I agree that those events are high risk, but they also don't particularly demonstrate a vulnerability that was previously unknown, or are AIxBio related.
The website can be very nice as a resource for biosecurity researchers searching for more context and rough assessments of different historical and ongoing events, but I am a little doubtful that it can serve as a policy-oriented platform for warning shots.
Covers an important topic in AI x Bio risk, which is recognizing early warning signals and then acting upon them in a way that mitigates future risk. Dashboard works very well for what it sets out to do and is useful as a means of getting an overview of historical early warning. As an educational (or even eventually a research) tool it is a great addition. However, it is in the governance aspect where - as revealed by the project - the key deficits lie. I am not convinced overall that a dashboard, even one that logs governance failures in responding to early warning, gets at the crux of the problem. This is because I do not believe that the problem is one of insufficient awareness on the part of authorities, but rather insufficient prioritization and over-politicization of biosecurity risks. I do not see a dashboard - even a very attractive and user-friendly one - doing much to mitigate these obstacles. Maybe around the margins, because they allow a clear and concise story to be told, but I do not think this is going to do much to move the needle on actual government response to risk. Overall, a well-executed project with strong informational and educational value, but it is unlikely to directly contribute much to decreasing biosecurity risk.
Cite this work
@misc {
title={
(HckPrj) OSINT BioGovernance Tool
},
author={
Zhamilia Klycheva, Edidiong James, Erik Leklem, Shubhankar Dharmadhikari
},
date={
4/27/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


