Mar 22, 2026

Perturbation-Based Generation Profiling Detects Covert AI Agent Attacks Where Token-Level Statistics Fail

Yatharth Maheshwari, Arka Dash, Abhineet Som

We propose Perturbation-Based Generation Profiling (PBGP), an unsupervised protocol detecting covert AI agent attacks by comparing generation profiles across context perturbations. Across six models (8B–32B), token-level statistical features achieve only 0.56–0.83 AUROC. PBGP achieves 0.94–1.00 AUROC with zero training data, requiring only output logprobs and isolated re-runs.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The core idea behind this has some promise. At it's core, monitoring is just a classification problem. So a natural idea is to just see if U is acting out of distribution and flag if so. Unfortunately, there are reasons this is more complicated in practice.

Once you have your real untrusted model, one idea is to just train another smaller model to predict the outputs of the untrusted model and flag when it's acting out of its distribution. This idea I think has promise but there are a lot of considerations one needs to worry about with this regime. Most notably, subliminal learning, so the dumber model might subliminally also be misaligned.

The high FPR with this approach is less of a concern as I expect this to be used in a hierarchical monitoring setup, though worth noting.

Anyway, the implementation in this paper is very flawed because it assumes that during deployment, you can just ask untrusted to just act honestly once, and then ask it to act I guess as it would normally a second time (and then compare the outputs). Obviously the untrusted model would just act the same or you have no way of knowing if it was actually acting honestly the first time.

They tested 71 token-level features across 6 models to see if output stats alone can catch covert attacks. Very comprehensive effort, none of them worked well enough. Then they came up with PBGP which reruns the model on the same task but in a clean context and compares the two generation profiles. Genuinely novel idea. Went from 0.56-0.83 AUROC to 0.94-1.00 with zero training data. Solid statistical validation throughout. They also found training method matters more than model size for stealth which is not obvious. Hasn't been tested on frontier models though, and the abliterated models they used just blindly do bad stuff rather than strategically scheming, so unclear if PBGP holds up against something actually trying to hide.

Cite this work

@misc {

title={

(HckPrj) Perturbation-Based Generation Profiling Detects Covert AI Agent Attacks Where Token-Level Statistics Fail

},

author={

Yatharth Maheshwari, Arka Dash, Abhineet Som

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.