Mar 23, 2026
Trigger Happy: Trusted Monitors Mistake Noisy Failures for Adversarial Intent
Prithviraj Singh Shahani
We set out to test whether LLM-based monitors can detect adversarial intent when the harmful actions themselves fail, and whether that detection holds up when the evidence of failure is removed. We built two LinuxArena environments where OS permissions make the adversarial side task impossible, ran attack-policy agents against them, and pruned failure steps from the resulting trajectories. We found that monitors over-rely on the noise of failed actions: removing them drops safety curve performance and shifts monitor narratives from confident accusations to hedged suspicion.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Trigger Happy: Trusted Monitors Mistake Noisy Failures for Adversarial Intent
},
author={
Prithviraj Singh Shahani
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


