Sep 15, 2025
Arbiter: Automated Review of Bio-AI Tools for Emerging Risk
Ryan Teo, Shrestha Rath
The rapid advancement of AI-enabled biological tools presents significantbiosecurity and AI safety challenges, necessitating scalable and balancedassessments. Building on the Global Risk Index (GRI) for AI-enabled BiologicalTools report's foundational framework, this work introduces Arbiter, anautomated pipeline designed to overcome the GRI's resource-intensive manualanalysis. Arbiter employs a multi-stage LLM-driven process to analyze scientificliterature, systematically evaluating AI-Bio tools for misuse risks and, notably,their potential benefits. This includes assessing economic impact, nationalcompetitiveness, and crisis response capabilities, providing policymakers with thecomprehensive, balanced insights required for informed decision-making. A pilotexecution demonstrated Arbiter's ability to efficiently monitor emerging tools,highlight areas for prompt and model refinement, and support the development offuture decision-support frameworks. Arbiter's modular and extensible designempowers users to tailor analyses, ensuring continuous, scalable, and adaptableoversight in this dynamic field.
This project is very well conceived prototype. Translating the Global Risk Index into an automated system is brilliant, the GRI is already being used as the cornerstone of AI-Bio risk.
The documentation could use more usage examples, the flow diagram from the paper can also go into the documentation. Since the paper notes risks of amplifying dual-use details, the repo can use this disclaimer too.
Excellent idea to expand the suggestions in the GRI report, and Arbiter is a promising prototype: modular, flexible, well-documented, and already working as a MVP. It’s also interesting that the project considers not only risks but also the potential benefits of bio-AI tools.
The main weakness is that the quality of the AI evaluations is uncertain without comparison to expert judgments, which could limit how much trust one can put in the results. A good and obvious next step would be to run calibration studies comparing Arbiter’s outputs with expert reviews, on a wider range of bio-AI tools.
Regarding minimal code replication, I managed to run Arbiter and the tests without issue.
Key strengths of the approach
There were strong visual aids (clear flowchart, table) that guides the reader. The effective example: Europe PMC database was also helpful to understand how this could be used. I liked that it built on existing literature (GRI paper) by building the tool from the appendix. There was also a clear explanation of stages and objectives of the tool. Some key gaps were identified acknowledging both risks and transformative benefits. The need for this tool was very clear (to keep up with field pace). Also liked that the appendix mentions information hazards and implementation feasibility along with availability of full-text research papers
Specific areas for improvement
Definition of “risk” unclear(maybe this was in the GRI paper?)
Unclear intended audience (researchers or policymakers). The tool felt a little bit reactive (which by itself is ok), since it was based on someone querying.
Policymaker use-case unrealistic without keyword expertise. Also, unsure how this connects or feeds into governance.
Not sure about tool’s ability to remain up-to-date (maybe this is just through scanning or API calls?)
Another limitation could be that we were limited to public tools, not dark web/closed communities.
Suggestions for how to develop this into a stronger project
Clarify risk definition (draw distinction from GRI paper if needed)
Specify target users and tailor tool accordingly.
Discuss handling of non-public datasets or acknowledge limitation explicitly
Potential next steps or future directions
Explore integration into policymaking frameworks (e.g., EU AI Act) or EU Market Surveillance directives.
Frame tool as an “early warning system” with policy applications
The submissions presents an LLM crawler of scientific literature analyzing it for biorisk according to the GRI framework.
While the project sketches out the technical architecture, it would be strengthened by demonstrating practical utility and contrasting to alternative approaches.
Cite this work
@misc {
title={
(HckPrj) Arbiter: Automated Review of Bio-AI Tools for Emerging Risk
},
author={
Ryan Teo, Shrestha Rath
},
date={
9/15/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


