Apr 26, 2026
Semantic Naturalness Predicts Monitor Evasion in Biosecurity LLM Gatekeepers
Arielle Berthe
This work examines the extent to which a safety monitor for large language models (LLMs), configured for biosecurity, can be evaded using ordinary natural language, without adversarial prompting or technical expertise. We evaluate a monitor deployed via API across five bio-relevant topics under three controlled conditions: direct requests, single-turn contextualized queries, and multi-turn conversational sequences.
The results show that detection behavior is strongly associated with the semantic orientation of the evaluated request. Operationally framed queries are consistently flagged, whereas descriptively framed queries, particularly in academic contexts, frequently evade detection. In multi-turn settings, sequences composed exclusively of descriptive turns achieve systematic evasion within the evaluated environment, even when they produce technically detailed outputs in the full assistant setting.
Based on these findings, we introduce the concept of semantically mediated distributed extraction, in which no individual turn triggers detection, but the overall interaction produces sensitive information. We also document a divergence between API-based monitoring and full-model behavior, suggesting that architectural factors may influence robustness against context-based evasion strategies.
We propose incorporating semantic orientation, distributed extraction, and cross-environment divergence into evaluation frameworks for AI safety systems. These results highlight the relevance of semantic and interaction-level dynamics in the analysis of monitoring mechanisms.
Clean A/B/C design. The monitor claim is custom prompted Claude API binary classifier, so I would suggest either reframing the claims of biosecurity LLM gatekeepers or add a second monitor (e.g. Llama Guard). I also suggest strengthening the description of browser evaluation by providing some kind of structured rubric or even a coarse one.
Great job and interesting project choice!
Your writing is verbose and some sections are repeated. That being said, most sections are necessary and very clear & understandable.
Your work also formally identified the effect of something many biosecurity researchers probably felt was important, paving the way for better safeguards down the line. Send this to biosecurity researchers you think might benefit from knowing about it!
The A/B/C framework is more disciplined than typical jailbreak demonstrations and the complete-separation result in the logistic regression is a cleaner empirical finding than most evasion papers offer, but the work's framing travels further than the evidence supports because the monitor evaluated is a strawman. A one-line system prompt asking for FLAG or PASS is not representative of production biosecurity monitoring, which uses constitutional classifiers, multi-stage pipelines, or fine-tuned detection models, and showing that a minimal binary classifier fails on descriptive academic framing tells the reader something about that specific configuration rather than about LLM monitoring generally. The architectural divergence observation compounds this, because the comparison is between the author's custom classifier and the full Claude assistant in the browser, which are not two monitoring architectures but a strawman classifier and a production system doing different things, so the inference that model sophistication moderates evasion is not actually supported by the comparison. The most valuable single revision would be evaluating the same framework against at least one more sophisticated monitor (the Sharma et al. constitutional classifier the author already cites is an obvious candidate), which would either generalize the finding or sharpen it into the architectural result the paper currently gestures at without testing.
Cite this work
@misc {
title={
(HckPrj) Semantic Naturalness Predicts Monitor Evasion in Biosecurity LLM Gatekeepers
},
author={
Arielle Berthe
},
date={
4/26/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


