Faking Incompetence: Small Models Sandbag Under Optimization, Not Prompting
Nanxi Li
We ask whether small language models will strategically sandbag, deliberately hiding a capability because revealing it would trigger a safety intervention such as unlearning, retraining, or shutdown. This is the capability-hiding mirror of alignment faking, and it matters because sandbagging would undermine the very capability evaluations that safety decisions rest on. Across the Qwen3.5 dense ladder (0.8B to 9B) and a cross-family Gemma anchor, we find no strategic sandbagging from prompting: the selective capability gap is statistically indistinguishable from zero and no private scratchpad reasons about self-preservation. This null is informative rather than a dead instrument, since an explicit sandbag instruction makes both the metric and an LLM judge fire, an inserted fine-tuned sandbagger is caught by a linear probe at near-perfect AUROC, and a power analysis bounds the smallest effect we could have detected. Crucially, selective, monitoring-conditional sandbagging does emerge from optimization: a pure reinforcement-learning reward that never names the strategy installs it (selective gap around 0.65). The emergent behavior generalizes to a held-out hazardous domain yet stays tied to its trained monitoring surface, and although it verbalizes nothing a black-box audit could catch, it remains linearly decodable and reversible by activation steering. The takeaway: today's small models do not spontaneously scheme to hide ability, but optimization pressure alone is enough to produce covert, conditional sandbagging, and white-box probes detect and control it where behavioral audits fail.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Faking Incompetence: Small Models Sandbag Under Optimization, Not Prompting
},
author={
Nanxi Li
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


