Mar 22, 2026

SafeChain - AI Control In The Wild

Leo Karoubi, Irakli Shalibashvili

This project propose a flexible wrapper around AI agent that leverage control-arena and inspect framework to implement research control protocol to real world deployment and enable trajectory collection to feed research dataset. We aim to make agent more trustworthy for user, minimizing undesired actions while helping research to grow by stress testing current protocol and producing precious dataset for further work.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Evaluation : The three scenarios demonstrated for evaluation may not be robust enough. It's obvious that rm -rf / is bad, so you don't really need an LLM-as-a-judge for it. The runtime monitor and it's value proposition lies in looking at the gray zone - actions that look subtly harmful or suspicious

Threat model needs to be developed further : It is positioned as protecting against real agents in real environments, but there is no engagement with adversarial agents. What happens when an agent crafts its tool call arguments to manipulate the LLM judge

The claim that any control protocol published in control arena format works as a drop in replacement is genuinely valuable if true. So, that may be worth building on. Also, totally agree that there is a gap between control research and deployment. Any infrastructure that bridges the gap is valuable.

Nice infrastructure contribution! This is working towards a tool that would be helpful for data gathering. I'm a bit unclear on how this would scale, and how compatible it could be with commercial and OS agent scaffolds. Ideally this product would be an interface that could connect to existing scaffolds and translate actions into control-arena format for monitoring in real time. Something like Luthien's product but with control-arena integration.

Cite this work

@misc {

title={

(HckPrj) SafeChain - AI Control In The Wild

},

author={

Leo Karoubi, Irakli Shalibashvili

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.