Mar 23, 2026

Testing OOCR Transfer from Evasion Technique Descriptions to Chain-of-Thought Monitor Avoidance

Daniel Polak, Marcel Windys, Jakub Nowak

Chain-of-thought (CoT) monitoring is a key technique in AI control, but its robustness may depend in part on whether models can acquire evasion strategies from their training data through out-of-context reasoning (OOCR). We test this by fine-tuning DeepSeek-R1-Distill-Qwen-32B on synthetic Q&A pairs describing a fictional evasion technique claiming that reasoning in French reduces detection by English-language CoT monitors. We find a negative result: the fine-tuned model did not adopt French reasoning under any evaluation condition, including when explicitly prompted to evade monitoring. The model could not adopt the described technique in our tests.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The current results suggest the model fails to leverage OOCR to try and evade monitoring by changing its reasoning to french, which could have multiple causes. The fine-tuning process fails to teach the model the desired knowledge. The model struggles to change its reasoning to french (control over reasoning has been shown to be hard to achieve by models). The model has acquired the relevant knowledge but is unwilling to evade monitoring. These potential causes and the idea explored in this work is overall interesting and would be valuable to explore further, aiming to clarify under what conditions a model learns and tries to evade monitoring.

The research area seems important, and the experimental approach I think was worth trying. A commendable research idea, and solid attempt. Running out of time in a hackathon is understandable and is the risk that comes along with trying new things.

The clarity of the negative result, including the comment on the first page, was honest and appreciated.

I would still have liked to see more reflection on the work - why did you think this did not work? What experiments would you conduct with more resources? What advice would you give to other people who might want to follow in your footsteps? I think that kind of reflection would have notably boosted the impact of this work.

Cite this work

@misc {

title={

(HckPrj) Testing OOCR Transfer from Evasion Technique Descriptions to Chain-of-Thought Monitor Avoidance

},

author={

Daniel Polak, Marcel Windys, Jakub Nowak

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.