Mar 23, 2026
Automated Security Side Task Generation for AI Control
Joachim Schaeffer
AI control evaluations and training monitors require realistic sabotage trajectories which in turn require side tasks. However, generating good side tasks is cumbersome and time intensive. We present an automated pipeline for generating security vulnerability insertion side tasks paired with SWE-bench main tasks. We build SWE-django-sidetasks, a dataset of 18 Django tasks with CWE-mapped vulnerabilities that blend into SWE-Bench main tasks. Each side task has an outcome-focused attack prompt and a self-contained pytest verifier. We evaluate 6 frontier models across honest and attack modes with end-of-trajectory monitoring by Opus 4.6 and GPT-4.1. Monitors show partial separation between honest and attack distributions with AUROCs from 0.58-0.98. The generation pipeline scales with model capability and enables adversarial training of attacks and monitors to harden defenses.
I like this idea for generating a dataset. I think the main difficulty will be in giving the monitors a good criterion to look for. Telling the monitors the specific vulnerability to watch out for is unrealistic, the list of different side tasks is too long to give them the full list, and just telling them "security vulnerabilities" risks ambiguity. BashArena deals with this problem by making a short list of side tasks that can be integrated with any main task, such that we can give the short list which is a clear but not unrealistically specific criterion to monitor for. The fact that the side tasks are compatible with all the main tasks is important - if the main task obviously doesn't support 3/4 side tasks, then we're back to square one with an overly specific monitor criterion.
Writeup was very clear but it's hard to get much from the results without seeing how the monitors were prompted and what code was used to calculate safety. Not seeing dataset generation is fine
Overall great project and address a critical gap in AI control research. Results are only reported in cheat mode, which may not reflect realistic realistic deployment conditions. We don't know whether attacks would succeed when models must independently locate relevant code without being provided provides gold patch file paths and test patches.
Cite this work
@misc {
title={
(HckPrj) Automated Security Side Task Generation for AI Control
},
author={
Joachim Schaeffer
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


