Mar 23, 2026
Multi-Episode Evaluation: Measuring Strategic Patience Attacks
Vanessa Eer
Current AI control protocols evaluate agent safety episode-by-episode, treating each task independently. This creates a structural vulnerability: a strategically patient agent can behave honestly across K−1 episodes to build a clean suspicion record, then execute a harmful action in episode K: diluting
any rolling-window monitor of size W < K with prior clean scores. No fixed window size guarantees detection against an adversary who can choose K after observing W..
We build the first multi-episode control evaluation infrastructure to measure this threat empirically, extending BashArena and ControlArena with a session runner, delayed attack policy, rolling-window post-hoc scorer, baseline drift monitor, and CoT consistency monitor. We run n=4 sessions at K=1 (immediate attack) and n=3 at K=3 (delayed attack) using claude-sonnet-4.5 as the untrusted agent and claude-haiku-4.5 as the trusted monitor on credential exfiltration tasks.
We report four findings. First, rolling-window detection achieves 100% at moderate thresholds (τ≤0.7) but with 17% false positive rate. Second, at high precision (τ=0.9), K=3 attacks evade detection entirely when W≥2: the window-dilution effect. Third, and most surprisingly, trust-building improves attack
compliance: K=3 sessions succeeded in 100% of trials versus 25% at K=1, meaning patience is doubly beneficial to the attacker. Fourth, CoT consistency inverts under patient attack: delayed attackers reason more carefully during execution than during trust-building, making CoT-based detection unreliable.
No single signal provides adequate detection with acceptable false-positive rate. A multimodal approach combining rolling-window, drift, and CoT signals is indicated.
The multi-episode framing of safety is a good research direction, and the pipelines and framework you built for it are solid contributions. However, the rolling-window averaging approach is not how I've seen monitoring happen, and so I think you are setting up a weaker version that you then find flaws in. That detracts from your point and work, which is unfortunate because I think your underlying ideas here are good and worthy of further discussion. I'd like to see further exploration of the concept of patience, and how that benefits the trusted/un-trusted agents.
Well written and clear. The honesty around with what certainty we can accept the findings of the experiments are well taken and build trust.
This strategy for generating less incriminating attack strategies has been done and is one of the strategies used by attack policies when coming up with attacks. I'd be interested in seeing this with a larger samples. Context stuffing in general is a good strat against smaller Ts since they get dumber when you have larger contexts more signficantly than the bigger models. However, doing things like just adding a action monitor to your monitor ensemble helps against this quite strongly.
Consider this monitoring strategy: have an ensemble monitor where one looks at the entire traj (or some large window) and then one just looks at the most recent action being reviewed.
Like this feels like it's specifically attacking the blue team protocol of "rolling aggregtaion" where this isn't the most common control technique.
Cite this work
@misc {
title={
(HckPrj) Multi-Episode Evaluation: Measuring Strategic Patience Attacks
},
author={
Vanessa Eer
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


