Mar 23, 2026
Stitchguard
Ranabir Devgupta, Pranav Darshan, Valmik Nahata, Howard kao
As LLM agents are deployed in higher-stakes environments, a central safety problem is hidden-objective sabotage: agents can appear helpful on the surface while covertly pursuing harmful side goals. Most existing monitors evaluate actions one step at a time, using thresholds or simple count heuristics. These methods work for blatant violations, but often miss attacks that are spread across many individually reasonable actions.
We propose StitchGuard, a trajectory-aware control protocol that combines step-level suspicion scoring with short- and long-horizon memory. Rather than treating each action in isolation, StitchGuard accumulates weak evidence across time, allowing it to detect both sudden anomalies and slowly unfolding multi-step sabotage. This enables earlier and more reliable intervention without over-penalizing benign exploratory behavior.
We test StitchGuard in the ControlArena IAC-Fast and Bash settings, comparing it against per-step threshold monitors and three-strike baselines. Across these environments, StitchGuard catches coordinated attacks that native limiters miss entirely, substantially improving true-positive interventions while preserving a perfect false-positive record on honest benign trajectories. Our results highlight the value of trajectory-level memory as a practical and robust ingredient in AI control protocols.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Stitchguard
},
author={
Ranabir Devgupta, Pranav Darshan, Valmik Nahata, Howard kao
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


