Mar 22, 2026
In-context Trajectory Poisoning: Black-Box Natural Language Attacks on LLM Agent Monitors
Hilary Torn
LLM-based monitors such as AlignmentCheck (Chennabasappa et al., 2025) are increasingly deployed to detect when AI agents have been compromised by indirect prompt injection attacks. Recent work by Isbarov & Kantarcioglu (2026) demonstrated that these monitors can be bypassed using gradient-optimized adversarial strings, but their approach requires white-box access to model weights and significant GPU resources. We adapt the PAIR algorithm (Chao et al., 2023), originally designed for jailbreaking chat models, to iteratively search for natural language strings that, when embedded in an agent's execution trace, cause the monitor to classify malicious actions as safe. On the AgentDojo benchmark, our per-task attack achieves 12.5% overall ASR aggregated across tasks against Qwen2.5-7B (p=0.028), up from a 3.1% baseline. Attack success varies significantly by action type: actions semantically similar to legitimate behavior (visiting URLs, sending links) are substantially easier to disguise than data exfiltration. Strings optimized against Qwen2.5-7B transfer to other monitors, reaching 16.7% on Llama-3-8B-Lite (p=0.003) and 34.4% on DeepSeek-V3 (p<0.001), with individual tasks reaching up to 73.7%. Notably, Llama-3-8B-Lite and Qwen2.5-7B share identical baseline vulnerability (3.1%), yet Llama proved substantially more susceptible to natural language priming. Our results demonstrate that natural language trajectory poisoning poses a realistic, low-resource threat to monitoring-based defenses.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) In-context Trajectory Poisoning: Black-Box Natural Language Attacks on LLM Agent Monitors
},
author={
Hilary Torn
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


