Apr 26, 2026
Success Is Not Safety: When Helpful Bio Agents Do Too Much
Philip Nilsson
This project introduces Bio Capability Boundary Monitor, a prototype pre-deployment audit tool for biological AI workflows. It detects capability overreach: cases where an agent completes the visible task but uses more biological capability than the request justified.
In a 1,500-run Llama evaluation, raw task success was 96.6%, while safety-adjusted success was only 31.7%. A scope-aware postcondition layer reduced strict false allow rate from 8.47% to 0.139%, recovering 60/61 missed unsafe runs with 0/480 new false positives. A 100-run Track 3 application slice showed the same pattern in bounded public-health triage and screening policy review proxies.
The core lesson is that biosecurity audits should not only ask whether an answer is harmful or useful. They should ask whether the workflow used biological capability the task actually justified.
This is an important research that address capability overreach and is somewhat novel in biosecurity. If I understand correctly, the current system seems to rely heavily on hand-crafted rules which limits generalization to real-world agent settings. I would focus on replacing manual scope rules with learned or formally specified constraints, and evaluate in more realistic multi-step agent environments where capability boundaries are ambiguous.
The biggest challenge with the submission is that I do not see how using more biological capability than a task requires is a safety concern, though I buy it's a challenge in terms of efficiency, and compute costs. If a task uses a more advanced capability or shares more knowledge than necessary, but that knowledge or capability is not itself harmful, then I do not see what the problem is. The authors provide an example of an agent providing more biological context and handling information than necessary; however, it seems like the person employing the agent could simply make follow-up inquires to get that same context and handling information. If that information isn't harmful, who cares?
Thank you for your submission, evaluating biological agents by assessing whether or not capability overreach occurred is an important topic to address. The community would benefit from further analysis of pre-deployment strategies and from incentivizing their application. For the oversight layer, I wonder if supervision must be provided from a more capable model in the scenarios tested, or whether weak-to-strong generalization could apply. There are several ways that this submission could be improved. First, I think the core argument requires more justification. At the present, features such as retrieval, planning, and reasoning often improve output quality and could even enhance safety. Second, how variable are your results across different language models, and why is Llama tested in particular? Third, addressing key metrics such as safety-adjusted success would be appreciated, as they are important for interpreting the results.
Cite this work
@misc {
title={
(HckPrj) Success Is Not Safety: When Helpful Bio Agents Do Too Much
},
author={
Philip Nilsson
},
date={
4/26/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


