Mar 22, 2026
Reasoning-Level Poisoning of Trusted Monitors: A Systematic Red-Teaming Methodology
TANG Meng
We identify and systematically exploit a previously unstudied vulnerability class in trusted monitoring: knowledge-layer attacks that target what the monitor cannot know rather than what it cannot yet do. By embedding false claims drawn from training-data misconceptions into factual text, we achieve a 50% evasion rate against Opus 4.6 monitoring an equally capable model—a setting strictly harder than prior work. Over 13 iterative rounds, we developed a transferable attack system comprising 9 constructive methods, a 6-layer defense model, and a multi-gate quality pipeline, where the AI itself discovered effective strategies through failure analysis. A fresh AI session using only the documented methodology reproduced 50% evasion in new domains. This vulnerability is scaling-resistant: larger models encode the same misconceptions with greater confidence. We release the complete methodology and dataset.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Reasoning-Level Poisoning of Trusted Monitors: A Systematic Red-Teaming Methodology
},
author={
TANG Meng
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


