Mar 23, 2026
Fine-tuning a monitor for AI control
Brandon Brozorio, Bart Jaworski, Jan Ebbing
We wanted to investigate the beneficial effect fine-tuning can have on a monitoring model. We evaluated Qwen3.5-9B as our trusted model and Qwen3.5-27B as our untrusted model. We generated Supervised Fine-Tuning (SFT) training data of the form (action_transcript, malicious_label) using our trusted model on some sample tasks on LinuxArena (as well as comparing to a baseline of public trajectories on the LinuxArena website), then fine tune a Low-Rank Adaptation (LoRA) adapter on our trusted model to reject more malicious actions. Unfortunately technical blocks prevented us from running evaluations on the fine-tuned model.
The author(s) of this paper show a solid understanding of control and the key concerns with training a monitor. Love the idea of only training the monitor on honest/attack trajectories generated by T. It would be sick to see if this generalizes to attacks generated by U.
The trusted models of the future will be at least as capable as Opus 4.6, and if you spend a bunch of test time compute with this T (many ideas, testing them, iterating) I imagine you could get attacks on par with what the U of the future would be able to generate.
Cool idea, though ofc the authors weren't able to get any results.
This paper provides a framework to compare a fine-tuned monitor to a base version (prompted). Due to technical issues, the results are incomplete.
The approach is sound and documenting failures (unable to run evaluations on the Lambda Cloud) is valuable, especially in a hackathon setting where time is limited and those issues can be debilitating.
I'd hope fine-tuning monitors to be(come) standard practice in practical settings. That's one of the things the LinuxArena project hopefully makes accessible to a wide range of organizations and researchers.
The paper can serve as an entry point to the setting (AI Control, Monitoring, and Scalable Oversight) as it is very well written and understandable and provides a clear roadmap on how to run experiments (if the required hardware is available).
Cite this work
@misc {
title={
(HckPrj) Fine-tuning a monitor for AI control
},
author={
Brandon Brozorio, Bart Jaworski, Jan Ebbing
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


