Nov 24, 2025

Detecting Piecewise Cyber Espionage in Model APIs

Arthur Colle, Alexander Reinthal, David Williams-King, Yingquan Li, Linh Le

Detecting Piecewise Cyber Espionage in Model APIs -

On November 13th 2025, Anthropic published

a report on an AI-orchestrated cyber espionage

campaign. Threat actors used various tech-

niques to circumvent model safeguards and

used Claude Code with agentic scaffolding

to automate large parts of their campaigns.

Specifically, threat actors split campaigns into

subtasks that in isolation appeared benign. It is

therefore important to find methods to protect

against such attacks to ensure that misuse of

AI in the cyber domain can be minimized.

To address this, we propose a novel method

for detecting malicious activity in model

APIs across piecewise benign requests. We

simulated malicious campaigns using an

agentic red-team scaffold similar to what

was described in the reported attack. We

demonstrate that individual attacks are blocked

by simple guardrails using Llama Guard 3.

Then, we demonstrate that by splitting up the

attack, this bypasses the guardrail. For the

attacks that get through, a classifier model

is introduced that detects those attacks, and

a statistical validation of the detections is

provided. This result paves the way to detect

future automated attacks of this kind.

Review Project

See Code

View Presentation

Reviewer's Comments

This is nice work, a framework for tracking separate multi-step behaviour as attack chains fits well with papers like “Adversaries Can Misuse Combinations of Safe Models”. It would be great if future versions explored how to distinguish malicious use from legitimate security workflows at the framework level, since right now both seem to show up as the same kind of suspicious chain. Some additional polish on the repo would also make it much easier to use.

I really like using the Anthropic report as the foundation, but I worry about whether a classifier model-based approach is a sophisticated enough solution to a nearly fully autonomous, 30-target nation-state operation. I was impressed by the observation that per-request guardrails fundamentally fail against decomposition attacks, and I think the novel insight is in the architecture: you need cross-session correlation by shared targets (IPs, domains) to reconstruct attack chains, and the detection surface should shift in this direction. There are some big limitations to consider:

--Nation-state actors use multiple accounts, providers, and VPNs

--The classifier can be evaded with more sophisticated obfuscation

--Synthetic training data limits real-world validity

--If an attacker splits across multiple API providers, correlation breaks

This approach might catch unsophisticated attackers, and is a really smart foundation and strong hustle for the purposes of the hackathon! But long-term, the adversarial robustness question is unaddressed. Guardrails as a category have a ceiling, and this project doesn't grapple with where that ceiling is.

I really admire the research contribution in demonstrating the failure mode, yet encourage the team to dig a bit deeper on defensive value. Nice work!

Cite this work

@misc {

title={

(HckPrj) Detecting Piecewise Cyber Espionage in Model APIs

author={

Arthur Colle, Alexander Reinthal, David Williams-King, Yingquan Li, Linh Le

date={

11/24/25

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026