Mar 23, 2026
Evaluation of Trusted Debate in LinuxArena
Jedrzej Konrad Kolbert, Travis Whitfield
We investigated using Trusted Debate as a trajectory-level monitor in LinuxArena to identify agent trajectories which contained sabotage. Trusted Debate involves using a Defender and Critic model to argue whether a given trajectory contains sabotage before a Judge model provides a final suspiciousness score. In our limited experiments, we show that the Debate framework actually lowers monitor performance compared to a lower-cost baseline, with the Defender in particular muddying the water and confusing the judge. An ablation which replaced the Defender prompt with a simple deferral to the Critic response recovers performance to be competitive with the baseline, suggesting that the Defender actively degrades monitoring performance.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Evaluation of Trusted Debate in LinuxArena
},
author={
Jedrzej Konrad Kolbert, Travis Whitfield
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


