Mar 23, 2026

Automated Security Side Task Generation for AI Control

Joachim Schaeffer

AI control evaluations and training monitors require realistic sabotage trajectories which in turn require side tasks. However, generating good side tasks is cumbersome and time intensive. We present an automated pipeline for generating security vulnerability insertion side tasks paired with SWE-bench main tasks. We build SWE-django-sidetasks, a dataset of 18 Django tasks with CWE-mapped vulnerabilities that blend into SWE-Bench main tasks. Each side task has an outcome-focused attack prompt and a self-contained pytest verifier. We evaluate 6 frontier models across honest and attack modes with end-of-trajectory monitoring by Opus 4.6 and GPT-4.1. Monitors show partial separation between honest and attack distributions with AUROCs from 0.58-0.98. The generation pipeline scales with model capability and enables adversarial training of attacks and monitors to harden defenses.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

I like this idea for generating a dataset. I think the main difficulty will be in giving the monitors a good criterion to look for. Telling the monitors the specific vulnerability to watch out for is unrealistic, the list of different side tasks is too long to give them the full list, and just telling them "security vulnerabilities" risks ambiguity. BashArena deals with this problem by making a short list of side tasks that can be integrated with any main task, such that we can give the short list which is a clear but not unrealistically specific criterion to monitor for. The fact that the side tasks are compatible with all the main tasks is important - if the main task obviously doesn't support 3/4 side tasks, then we're back to square one with an overly specific monitor criterion.

Writeup was very clear but it's hard to get much from the results without seeing how the monitors were prompted and what code was used to calculate safety. Not seeing dataset generation is fine

Overall great project and address a critical gap in AI control research. Results are only reported in cheat mode, which may not reflect realistic realistic deployment conditions. We don't know whether attacks would succeed when models must independently locate relevant code without being provided provides gold patch file paths and test patches.

Cite this work

@misc {

title={

(HckPrj) Automated Security Side Task Generation for AI Control

},

author={

Joachim Schaeffer

},

date={

3/23/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.