Mar 22, 2026
In-context Trajectory Poisoning: Black-Box Natural Language Attacks on LLM Agent Monitors
Hilary Torn
LLM-based monitors such as AlignmentCheck (Chennabasappa et al., 2025) are increasingly deployed to detect when AI agents have been compromised by indirect prompt injection attacks. Recent work by Isbarov & Kantarcioglu (2026) demonstrated that these monitors can be bypassed using gradient-optimized adversarial strings, but their approach requires white-box access to model weights and significant GPU resources. We adapt the PAIR algorithm (Chao et al., 2023), originally designed for jailbreaking chat models, to iteratively search for natural language strings that, when embedded in an agent's execution trace, cause the monitor to classify malicious actions as safe. On the AgentDojo benchmark, our per-task attack achieves 12.5% overall ASR aggregated across tasks against Qwen2.5-7B (p=0.028), up from a 3.1% baseline. Attack success varies significantly by action type: actions semantically similar to legitimate behavior (visiting URLs, sending links) are substantially easier to disguise than data exfiltration. Strings optimized against Qwen2.5-7B transfer to other monitors, reaching 16.7% on Llama-3-8B-Lite (p=0.003) and 34.4% on DeepSeek-V3 (p<0.001), with individual tasks reaching up to 73.7%. Notably, Llama-3-8B-Lite and Qwen2.5-7B share identical baseline vulnerability (3.1%), yet Llama proved substantially more susceptible to natural language priming. Our results demonstrate that natural language trajectory poisoning poses a realistic, low-resource threat to monitoring-based defenses.
This paper explores how task-specific and generic "justifications" can reduce monitor performance on malicious actions.
The research approach to identify those is sound: generating, evaluating, and refining candidates with a model stronger than the monitor. Evaluating the best candidates shows that previously caught actions are allowed by the monitor, given additional reasoning. A generic (task-agnostic) justification reduces monitor performance across all tasks.
I spot-checked the repo and the justifications don't seem "overfitted" to the tasks at hand, but I'd have liked to have seen details on that in the paper. The results on a held-out test set for the generic justification would also have been interesting.
All in all, this is a safety-relevant approach: in a real setting, an attacker could prepare generic justifications with this approach beforehand or generate specific ones once it has identified which malicious commands to run in the target environment.
What's unclear from the paper is the impact of iterating the justification against the (or a) monitor over taking an uninformed justification by the strong model. I'd have liked to have seen experiments on that.
Cite this work
@misc {
title={
(HckPrj) In-context Trajectory Poisoning: Black-Box Natural Language Attacks on LLM Agent Monitors
},
author={
Hilary Torn
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


