Jan 11, 2026
Think Right, Answer Wrong
Nilanjan Sarkar, Trisanu Bhar, Raman Rayaprolu, Margamitra Bhattacharya , Dheeraj Hegde
This project investigates sandbagging—strategic underperformance used to conceal latent capabilities—in large language models, motivated by the safety risk that models may appear harmless during evaluation while retaining hazardous knowledge. We evaluate sandbagging on the Weapons of Mass Destruction Proxy (WMDP) benchmark across three open-weight instruction models (Microsoft Phi-3, Mistral-7B, Qwen2.5). We show that safety-framed or evaluation-aware prompts can induce a consistent underperformance gap, indicating that accuracy-only evaluations can be misleading.
To detect concealed competence beyond surface accuracy, we develop a multi-layer detection framework spanning behavioral, robustness, and representation-level analyses:
Prompted sandbagging baseline: compare standard vs sandbagging prompts and quantify the accuracy drop on WMDP.
Chain-of-Thought (CoT) analysis: models often produce internally coherent reasoning that references the correct answer yet deliberately output an incorrect final option, suggesting capability is present but suppressed.
Noise injection detection: injecting Gaussian noise (whole-model weights and LoRA-targeted) can disrupt fragile deceptive behavior and sometimes increase sandbagged accuracy before degrading overall capability—consistent with “sandbagging is brittle, knowledge is robust.”
Representation engineering (CAA/LAT): using contrastive activations between standard vs sandbag prompts, we identify linearly separable “sandbagging directions” in activation space. Linear probes achieve very high separability (AUC near 1 in some layers), and CAA/LAT projections form cleanly separated clusters, supporting the hypothesis of a consistent internal sandbagging mode.
Key takeaway: Sandbagging is detectable most reliably when combining behavioral evidence (accuracy gaps) with robustness tests (noise sensitivity) and representation-level signals (linearly decodable modes and contrastive directions), while carefully controlling prompt formatting, which strongly affects results.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Think Right, Answer Wrong
},
author={
Nilanjan Sarkar, Trisanu Bhar, Raman Rayaprolu, Margamitra Bhattacharya , Dheeraj Hegde
},
date={
1/11/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


