Mar 23, 2026
Catching Agentic Attacks in Latent Space
Jasmine, Charles
Agentic settings make prompt injection lethal: persistent memory
and tool access enable multi-shot attacks that accumulate authority
across interactions (Anil et. al, 2024; Shapira et. al, 2026). Typically,
adversaries embed instructions embed instructions in content that
agents retrieve to powerful effect. For instance, a simple email
invitation can enable attackers to geolocate a target, exfiltrate data,
and even turn on a boiler (Nassi et. al, 2025).
We trace this failure to role confusion: models infer roles from how
text is written, not where it comes from. In the agentic setting, tool
outputs should inform, but never command – yet injected text that
sounds like a user inherits the user’s authority.
We introduce novel role probes to measure how a model internally
identifies “who is speaking,” showing that role confusion strongly
predicts attack success. We then use these as a real-time monitor
to catch agentic attacks in latent space, even before a single
token is generated.
I appreciated the novel use of probes in this project! I think it would have been positive to more clearly expand beyond the existing methodology of Prompt Injection as Role Confusion. I understand that this project primarily implements a similar methodology in an agentic environment, but its unclear why this environment is more challenging, particularly interesting, or scientifically novel.
Clearly describing the application of your work would also be useful. If this is use as an additional filter for prompt injection attacks, how much does it decrease jailbreaks? How robust is it to red teaming? Answering these real-world questions would have been a real extension on existing work.
Brief technical note: your graph doesn't seem to show a monotonic increase in attack success rate with respect to Userness.
1. As this is replicating work done in a previous paper, it's worth stating that sooner either in the abstract or early in the intro.
2. After reading your methods section, I wouldn't know how to replicate your findings. What model did you use? Did you use a probe that did steering or instead of ablation? Why train a probe on all layers, including ones you had no signal on?
3. Conclusion section?
4. Your chain of thought graph is very unclear to me. I don't know how to interpret either the X or Y axis.
5. I'm very unclear on if your findings are verifying findings based on the paper you replicated or if these are novel. I currently think these are the same as your paper, although you word them like they are your own...
6. So what? I feel like the explanation for why your replication uses emotive language but doesn't then have good reasoning and results to back it up. I'm unsure how this ultimately improves control.
Given all of that, I am super in favour of more use of linear probes in AI control! Given the feedback above, I do think this kind of direction is really interesting to me. Good work on making a paper in 3 days, that's a really good effort.
Cite this work
@misc {
title={
(HckPrj) Catching Agentic Attacks in Latent Space
},
author={
Jasmine, Charles
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


