Mar 23, 2026
Catching Agentic Attacks in Latent Space
Jasmine, Charles
Agentic settings make prompt injection lethal: persistent memory
and tool access enable multi-shot attacks that accumulate authority
across interactions (Anil et. al, 2024; Shapira et. al, 2026). Typically,
adversaries embed instructions embed instructions in content that
agents retrieve to powerful effect. For instance, a simple email
invitation can enable attackers to geolocate a target, exfiltrate data,
and even turn on a boiler (Nassi et. al, 2025).
We trace this failure to role confusion: models infer roles from how
text is written, not where it comes from. In the agentic setting, tool
outputs should inform, but never command – yet injected text that
sounds like a user inherits the user’s authority.
We introduce novel role probes to measure how a model internally
identifies “who is speaking,” showing that role confusion strongly
predicts attack success. We then use these as a real-time monitor
to catch agentic attacks in latent space, even before a single
token is generated.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Catching Agentic Attacks in Latent Space
},
author={
Jasmine, Charles
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


