Feb 2, 2026
Same Question, Different Lies: Cross-Context Consistency (C3) for Black-Box Sandbagging Detection
Yulong Lin
As language models grow more capable, governance frameworks—from the EU AI Act to Anthropic's RSP and OpenAI's Preparedness Framework—increasingly rely on capability evaluations to trigger safety mitigations. But what if models can deliberately underperform on these evaluations? This behavior, known as sandbagging, could let models evade safety measures designed for their true capability level. We introduce Cross-Context Consistency (C3), a general framework for unsupervised black-box sandbagging detection that exploits a fundamental asymmetry: when a model truly lacks capability, its confusion manifests consistently across paraphrased questions, but when a capable model feigns incompetence, its strategic choices about how to appear weak create detectable inconsistencies. Our simple instantiation—using embedding cosine distance and mean aggregation—requires no training data or model-specific adaptation, making it practical for external auditors. We evaluate C3 across 98 model organisms spanning prompted, fine-tuned, and incentivized sandbagging on GPQA, MATH, and USACO, achieving 67–100% recall at 5% FPR where existing black-box methods fail (lie detectors collapse from 89% to 2% recall when controlling for model identity). Adversarial strategies with deterministic error recipes can reduce detection to 4.7–46%, but complementary methods compensate—motivating defense-in-depth deployment for governance applications.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Same Question, Different Lies: Cross-Context Consistency (C3) for Black-Box Sandbagging Detection
},
author={
Yulong Lin
},
date={
2/2/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


