Mar 23, 2026

Taxonomy without complementarity: a necessary precondition for routing monitor ensembles

Robert Amanfu

AI safety monitors can be combined by routing—picking which monitor to trust based on the type of threat. But routing only helps if the monitors are complementary, meaning each one catches attacks the other misses. We test this on 130 AgentDojo episodes using action trace and chain-of-thought (CoT) monitors. Even with a perfect router that knows the true attack type, routing (AUROC=0.984) cannot beat CoT alone (0.998), because CoT already dominates on every category of attack. Our subtlety taxonomy is statistically valid (χ2=16.25, p=0.0003), but it reflects CoT winning everywhere rather than genuine complementarity. Adaptive attacks did not degrade CoT, though this likely reflects benchmark saturation rather than true robustness. We propose a three-step diagnostic that uses only existing scores—no new model calls—to check whether routing can help before any infrastructure is built

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Really like the clarity and honesty of this write-up! The results are negative but clear and seem well-executed. The checklist before deploying routers is practically useful. And the implications, including on benchmark saturation, at the end of the write-up are well explained and worth reading. Might be good to write this up for a larger audience! On the research side, I’d love to see this analysis repeated on a harder benchmark where CoT monitoring doesn’t saturate and ideally doesn’t universally dominate.

Here is a list of my thoughts as I read thru the paper.

1. After just reading the abstract, I feel like I understand the problem you are trying to solve, but I don't get where your work fixes that. Getting this right is important for other scientists to be convinced to read more of your paper.

2. I really like your first two intro paragraphs. The clarity and conciseness is really good. I want the rest of the paper to be as clear as this.

3 I would consider explaining your setup later in your methods section. While good, I only really care about it if I try to replicate the paper.

4. You should move the Subtlety Taxonomy to the intro. So you don't frame a research question about something the reader doesn't yet understand. I then want a couple of lines explaining why I should care about this research question, instead of just assuming the reader thinks it's good.

5. I feel like you should rewrite your results section for clarity. I want a computer science student to be able to understand it.

6. For your 3 recommendations, can you explain why you think these are important? I find your writing about monitor ensembles to be really interesting, but I personally don't care about how subtle an attack is to a monitor.

I really like your thinking about thinking about ensombles of monitors! This was really interesting to read thru. I'm curious what Redwood staff think of the paper. If you had shown me the paper without knowledge of the hackathon, I would be really surprised to hear that you made it in three days.

I feel like the papers' ideas are really good, but your clarity gets in the way. I want to review the research ideas in more detail, but I find myself having to go back and forth to get good answers. I want to be able to skim the paper, see and understand your 3 research questions. Then go to the results/conclusion and see 3 clear answers to those questions. Then, if I'm interested, go through the technical work you did in your method. Often getting the clarity good takes less time than programming during the hackathon, so this may be a time allocation thing.

Maybe next time it would be tactical to see which research questions were interseting to say Tyler during their office hours and then just going into more depth on just one.

- this was a real pleasure to read, thorough and seemingly airtight methodology and logic in defining complementarity clearly and as a precondition for routing, then testing this explicitly.

- the fact that AgentDojo attacks leave fingerprints in the reasoning traces that can be picked up by CoT is interesting. looking directly at those traces to understand what was flagged would have been useful, though the appendix does a bit of this already. perhaps CoT is not something that can apply in subversive real world environments where the fingerprints would be less visible in the reasoning trace.... would be curious as to further research in this direction.

- whilst the results seem reasonably hedged, the outcome is limited to this single benchmark / monitor pair and attack distribution. looking at reasoning traces from other attacks to better understand the ground truth of what's being monitored might be valuable here as well, and/or trying with further model pairs and benchmarks. in this same vein, it would be good to have human validation of the subtlety taxonomy - LLM-labels all the way down, so not sure exactly what we're measuring here

- excellent work, impressive for a single author submission, right size and clear results, if not super striking given the setup.

Cite this work

@misc {

title={

(HckPrj) Taxonomy without complementarity: a necessary precondition for routing monitor ensembles

},

author={

Robert Amanfu

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.