Mar 23, 2026
Logit-Weighted Suspicion Scoring in AI Control
Maxime Cugnon de Sévricourt, Alexander Reinthal, Hamza Mooraj
AI control relies on LLM monitors to detect suspicious behavior in
untrusted agents. These monitors are typically run multiple times per
evaluation to ensure statistically reliable suspicion scores, at
significant cost in time and tokens. A natural way to reduce this cost
is to extract more information from each monitor call by using the
model's token probability distribution rather than its argmax output.
We test this approach -- logit-weighted suspicion scoring, following
\citet{liu-etal-2023-g} -- against standard point-estimate scoring in
an AI control setting using 110 trajectories from AgentDojo, with both
honest and attack agent policies. We find that logit-weighted scoring
produces more continuous score distributions but does not improve
AUC-ROC relative to point-estimate scoring. We interpret this as
evidence that numerical anchoring in monitor outputs is rank-preserving
in this setting, and that monitor capability rather than score
granularity is the binding constraint on discrimination performance.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Logit-Weighted Suspicion Scoring in AI Control
},
author={
Maxime Cugnon de Sévricourt, Alexander Reinthal, Hamza Mooraj
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


