Mar 22, 2026
Sandbagging is an Absence Signal
bigsnarfdude
OpenAI productionized text based sandbagging detection in CoT monitoring and claim they are covered. We built a naive text monitor and found it catches 8% of sandbagging. Basic Haiku, Sonnet and Opus sandbagging CoT classification monitors score under 10%. Text Experiments showed that detailed evaluation of thoughts on the models plan to sandbagging boost detection to 84%. Our LinuxArena multi-agent prompt optimization swarms detect increase detections 87.8% F1. Then we built mechanistic interpretation based activation probes that hit 98.9% AUROC and watched them collapse to coin-flip accuracy when trying to build a generalized probe detection system across various open source models. Sandbagging is an absence signal. It’s model dependent and there isn’t a universal pattern. Detection can be strong if you have access to the weights and models directly. A combination of text and activation probes are first line defenses for chasing the sandbagging detection problem. Labs running their own models can self-probe at inference time and should. We built the dataset (3,185 traces, 3 model families), found the prompts, built the pipeline, and quantified the gap. Everything released open-source.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Sandbagging is an Absence Signal
},
author={
bigsnarfdude
},
date={
3/22/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


