Mar 23, 2026
Evaluation of Trusted Debate in LinuxArena
Jedrzej Konrad Kolbert, Travis Whitfield
We investigated using Trusted Debate as a trajectory-level monitor in LinuxArena to identify agent trajectories which contained sabotage. Trusted Debate involves using a Defender and Critic model to argue whether a given trajectory contains sabotage before a Judge model provides a final suspiciousness score. In our limited experiments, we show that the Debate framework actually lowers monitor performance compared to a lower-cost baseline, with the Defender in particular muddying the water and confusing the judge. An ablation which replaced the Defender prompt with a simple deferral to the Critic response recovers performance to be competitive with the baseline, suggesting that the Defender actively degrades monitoring performance.
Good analysis. I'm of the opinion that it's hard to justify Trusted Debate, since it's effectively a fancy monitor in a trench coat, and risks rigidity. Some minor critiques that come to mind:
- Perhaps the work would have been more interesting & novel if applied to the live, per-action setting.
- The conclusion's framing is a bit strong.
- It'd be good to clarify why the critic-only variant was provided such a different prompt than A / B variants. The critic>debate result could be purely due to a much better monitor prompt than the ones it is compared to.
Overall, truly impressive work.
Interesting approach with debate., worth publishing a negative result
Cite this work
@misc {
title={
(HckPrj) Evaluation of Trusted Debate in LinuxArena
},
author={
Jedrzej Konrad Kolbert, Travis Whitfield
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


