Mar 23, 2026

Catching Agentic Attacks in Latent Space

Jasmine, Charles

Agentic settings make prompt injection lethal: persistent memory

and tool access enable multi-shot attacks that accumulate authority

across interactions (Anil et. al, 2024; Shapira et. al, 2026). Typically,

adversaries embed instructions embed instructions in content that

agents retrieve to powerful effect. For instance, a simple email

invitation can enable attackers to geolocate a target, exfiltrate data,

and even turn on a boiler (Nassi et. al, 2025).

We trace this failure to role confusion: models infer roles from how

text is written, not where it comes from. In the agentic setting, tool

outputs should inform, but never command – yet injected text that

sounds like a user inherits the user’s authority.

We introduce novel role probes to measure how a model internally

identifies “who is speaking,” showing that role confusion strongly

predicts attack success. We then use these as a real-time monitor

to catch agentic attacks in latent space, even before a single

token is generated.

Review Project

View Related Sprint

Reviewer's Comments

I appreciated the novel use of probes in this project! I think it would have been positive to more clearly expand beyond the existing methodology of Prompt Injection as Role Confusion. I understand that this project primarily implements a similar methodology in an agentic environment, but its unclear why this environment is more challenging, particularly interesting, or scientifically novel.

Clearly describing the application of your work would also be useful. If this is use as an additional filter for prompt injection attacks, how much does it decrease jailbreaks? How robust is it to red teaming? Answering these real-world questions would have been a real extension on existing work.

Brief technical note: your graph doesn't seem to show a monotonic increase in attack success rate with respect to Userness.

1. As this is replicating work done in a previous paper, it's worth stating that sooner either in the abstract or early in the intro.

2. After reading your methods section, I wouldn't know how to replicate your findings. What model did you use? Did you use a probe that did steering or instead of ablation? Why train a probe on all layers, including ones you had no signal on?

3. Conclusion section?

4. Your chain of thought graph is very unclear to me. I don't know how to interpret either the X or Y axis.

5. I'm very unclear on if your findings are verifying findings based on the paper you replicated or if these are novel. I currently think these are the same as your paper, although you word them like they are your own...

6. So what? I feel like the explanation for why your replication uses emotive language but doesn't then have good reasoning and results to back it up. I'm unsure how this ultimately improves control.

Given all of that, I am super in favour of more use of linear probes in AI control! Given the feedback above, I do think this kind of direction is really interesting to me. Good work on making a paper in 3 days, that's a really good effort.

Cite this work

@misc {

title={

(HckPrj) Catching Agentic Attacks in Latent Space

author={

Jasmine, Charles

date={

3/23/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026