Mar 23, 2026
Fine-tuning a monitor for AI control
Brandon Brozorio, Bart Jaworski, Jan Ebbing
We wanted to investigate the beneficial effect fine-tuning can have on a monitoring model. We evaluated Qwen3.5-9B as our trusted model and Qwen3.5-27B as our untrusted model. We generated Supervised Fine-Tuning (SFT) training data of the form (action_transcript, malicious_label) using our trusted model on some sample tasks on LinuxArena (as well as comparing to a baseline of public trajectories on the LinuxArena website), then fine tune a Low-Rank Adaptation (LoRA) adapter on our trusted model to reject more malicious actions. Unfortunately technical blocks prevented us from running evaluations on the fine-tuned model.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Fine-tuning a monitor for AI control
},
author={
Brandon Brozorio, Bart Jaworski, Jan Ebbing
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


