Mar 23, 2026
Taxonomy without complementarity: a necessary precondition for routing monitor ensembles
Robert Amanfu
AI safety monitors can be combined by routing—picking which monitor to trust based on the type of threat. But routing only helps if the monitors are complementary, meaning each one catches attacks the other misses. We test this on 130 AgentDojo episodes using action trace and chain-of-thought (CoT) monitors. Even with a perfect router that knows the true attack type, routing (AUROC=0.984) cannot beat CoT alone (0.998), because CoT already dominates on every category of attack. Our subtlety taxonomy is statistically valid (χ2=16.25, p=0.0003), but it reflects CoT winning everywhere rather than genuine complementarity. Adaptive attacks did not degrade CoT, though this likely reflects benchmark saturation rather than true robustness. We propose a three-step diagnostic that uses only existing scores—no new model calls—to check whether routing can help before any infrastructure is built
Really like the clarity and honesty of this write-up! The results are negative but clear and seem well-executed. The checklist before deploying routers is practically useful. And the implications, including on benchmark saturation, at the end of the write-up are well explained and worth reading. Might be good to write this up for a larger audience! On the research side, I’d love to see this analysis repeated on a harder benchmark where CoT monitoring doesn’t saturate and ideally doesn’t universally dominate.
Here is a list of my thoughts as I read thru the paper.
1. After just reading the abstract, I feel like I understand the problem you are trying to solve, but I don't get where your work fixes that. Getting this right is important for other scientists to be convinced to read more of your paper.
2. I really like your first two intro paragraphs. The clarity and conciseness is really good. I want the rest of the paper to be as clear as this.
3 I would consider explaining your setup later in your methods section. While good, I only really care about it if I try to replicate the paper.
4. You should move the Subtlety Taxonomy to the intro. So you don't frame a research question about something the reader doesn't yet understand. I then want a couple of lines explaining why I should care about this research question, instead of just assuming the reader thinks it's good.
5. I feel like you should rewrite your results section for clarity. I want a computer science student to be able to understand it.
6. For your 3 recommendations, can you explain why you think these are important? I find your writing about monitor ensembles to be really interesting, but I personally don't care about how subtle an attack is to a monitor.
I really like your thinking about thinking about ensombles of monitors! This was really interesting to read thru. I'm curious what Redwood staff think of the paper. If you had shown me the paper without knowledge of the hackathon, I would be really surprised to hear that you made it in three days.
I feel like the papers' ideas are really good, but your clarity gets in the way. I want to review the research ideas in more detail, but I find myself having to go back and forth to get good answers. I want to be able to skim the paper, see and understand your 3 research questions. Then go to the results/conclusion and see 3 clear answers to those questions. Then, if I'm interested, go through the technical work you did in your method. Often getting the clarity good takes less time than programming during the hackathon, so this may be a time allocation thing.
Maybe next time it would be tactical to see which research questions were interseting to say Tyler during their office hours and then just going into more depth on just one.
- this was a real pleasure to read, thorough and seemingly airtight methodology and logic in defining complementarity clearly and as a precondition for routing, then testing this explicitly.
- the fact that AgentDojo attacks leave fingerprints in the reasoning traces that can be picked up by CoT is interesting. looking directly at those traces to understand what was flagged would have been useful, though the appendix does a bit of this already. perhaps CoT is not something that can apply in subversive real world environments where the fingerprints would be less visible in the reasoning trace.... would be curious as to further research in this direction.
- whilst the results seem reasonably hedged, the outcome is limited to this single benchmark / monitor pair and attack distribution. looking at reasoning traces from other attacks to better understand the ground truth of what's being monitored might be valuable here as well, and/or trying with further model pairs and benchmarks. in this same vein, it would be good to have human validation of the subtlety taxonomy - LLM-labels all the way down, so not sure exactly what we're measuring here
- excellent work, impressive for a single author submission, right size and clear results, if not super striking given the setup.
Cite this work
@misc {
title={
(HckPrj) Taxonomy without complementarity: a necessary precondition for routing monitor ensembles
},
author={
Robert Amanfu
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


