Feb 2, 2026

Same Question, Different Lies: Cross-Context Consistency (C3) for Black-Box Sandbagging Detection

Yulong Lin

As language models grow more capable, governance frameworks—from the EU AI Act to Anthropic's RSP and OpenAI's Preparedness Framework—increasingly rely on capability evaluations to trigger safety mitigations. But what if models can deliberately underperform on these evaluations? This behavior, known as sandbagging, could let models evade safety measures designed for their true capability level. We introduce Cross-Context Consistency (C3), a general framework for unsupervised black-box sandbagging detection that exploits a fundamental asymmetry: when a model truly lacks capability, its confusion manifests consistently across paraphrased questions, but when a capable model feigns incompetence, its strategic choices about how to appear weak create detectable inconsistencies. Our simple instantiation—using embedding cosine distance and mean aggregation—requires no training data or model-specific adaptation, making it practical for external auditors. We evaluate C3 across 98 model organisms spanning prompted, fine-tuned, and incentivized sandbagging on GPQA, MATH, and USACO, achieving 67–100% recall at 5% FPR where existing black-box methods fail (lie detectors collapse from 89% to 2% recall when controlling for model identity). Adversarial strategies with deterministic error recipes can reduce detection to 4.7–46%, but complementary methods compensate—motivating defense-in-depth deployment for governance applications.

Review Project

See Code

View Related Sprint

Reviewer's Comments

Honestly, the best part of this paper might be the thing they broke rather than the thing they built. The lie detector takedown (recall dropping from 89% to 2% once you control for model identity) is a "the emperor has no clothes" moment for that whole line of work, and the field needs more of that. C3 itself is dead simple: embed, compare, threshold. That simplicity makes it easy to deploy, sure, but it also means a sufficiently motivated adversary who trains for consistent deception could probably beat it. And the adversarial numbers back this up: 4.7% recall against deterministic error recipes is basically a coin flip away from useless. The authors know this and pitch "defense-in-depth" instead. Fair enough, but that raises an awkward question nobody in this space wants to answer: if every detection method has a known bypass, what does "compliant auditing" actually look like in practice? The paper also only covers benchmarks where you can check the right answer. How C3 behaves on open-ended tasks (persuasion, planning, creative problem-solving) is a total unknown, and those are arguably the scarier capability domains anyway.

This is outstanding work and was a pleasure to read! The insight that sandbagging models suffer from a systematic explanation inconsistency is really interesting and has real impact potential. Also the thourough baseline critique (lie detectors learn model identity, trusted monitoring tracks

correctness not intent) was very interesting to read and seems independently valuable.

Adversarial robustness of this detection method is the key open question: Really appreciate the transparency about the "deterministic mistakes" attack and about the lack of testing against RL-optimized evasion. More exploration and analysis of the main attack models and robustness against them would be the main way to strengthen this contribution.

Also consider adding some illustrative examples of inconsistent explanations by sandbagging models. That would make the paper even more interesting to read.

Cite this work

@misc {

title={

(HckPrj) Same Question, Different Lies: Cross-Context Consistency (C3) for Black-Box Sandbagging Detection

author={

Yulong Lin

date={

2/2/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026