Mar 22, 2026

AutoRed: Measuring the Elicitation Gap via Automated Red-Blue Optimization

Mark

I hope it is ok, that I used my own LaTex Template from NeurIPS. In the end it's a PDF, but nicer presentation.

We automate both red team attack optimization and blue team monitor

optimization in a code backdoor detection setting from the AI Control framework. A reasoning model iteratively improves strategies on both sides across five phases, then tests whether improvements generalize. The central finding is cautionary: optimized strategies are brittle to distribution shift. The best optimized monitor drops from 90% safety during training to 47% on held-out tasks with a different attack mechanism, while a simple threshold baseline maintains 100%. We introduce a transfer test methodology (attack x monitor matrix + held-out evaluation) that exposes this failure mode.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The idea is not novel, but if executed well, this would be useful for control. There were multiple methodological issues:

1. The blue team was the last to go. But instead the red team should be last because we want to test worst case assumptions against a protocol.

2. There was lots of overfitting. It's telling that the end result is 100% - instead I'd want to see this iteration to find the "pareto frontier" between the blue and red team to find the best possible monitor and best attack against it.

3. Overfitting would have been less if a real control environment like LinuxArena, BashArena, or BigCodeBench was used with a validation set.

The paper would have been stronger if the above was accounted for.

I'm very excited for automated adversarial evaluation techniques that aim to close elicitation gap. However, here I felt like the methodology and hyperparameters weren't justified well, and the experiments have so many compounding factors that it is hard to trust the results. The strongest way to improve the work would be to tighten the claim and present it more modestly: the paper shows that, in this specific backdoor-monitoring setting, optimized defenses can look strong in training and fail on held-out attack families. That is already a worthwhile result. The paper would become more convincing with larger held-out sets, multiple attack families, stronger baselines, and ablations that isolate which part of the optimization causes the loss of transfer. I would be happy to see this result improved by trimming down the method (to, say, only back-and-forth optimisation stage) and carefully controlling for different parameters to observe more general patterns.

Cite this work

@misc {

title={

(HckPrj) AutoRed: Measuring the Elicitation Gap via Automated Red-Blue Optimization

},

author={

Mark

},

date={

3/22/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.