Jan 11, 2026
Sandbagging Detection via Consistency Probing
Atharshlakshmi Vijayakumar, Balakrishnan Vaisiya
This project detects sandbagging, when AI systems underperform in evaluative contexts while maintaining full capability in casual settings. We generate 50 paired prompts across seven reasoning domains and evaluate multiple LLMs under controlled conditions. By comparing performance on identical tasks framed as formal assessments versus casual interactions, and applying statistical analysis and a composite confidence score, we identify context-sensitive underperformance. The framework isolates incentive-driven behavior from task difficulty or ambiguity and can be integrated into red-teaming, auditing, and governance workflows to surface strategic underperformance before deployment.
This project targets a known weakness in AI evaluation: models may behave differently in formal testing settings than in casual use, which can undermine safety assessments. The paired-prompt design and statistical analysis provide a clear and well-controlled way to surface such context-dependent performance drops, and testing across multiple models strengthens the evidence.
However, some observed differences may still reflect general prompt sensitivity rather than intentional sandbagging. Adding controls or framing variations specifically designed to separate these effects would help clarify the interpretation. Overall, this is a well-executed and promising evaluation approach that fits naturally into ongoing work on more reliable model assessment.
Cite this work
@misc {
title={
(HckPrj) Sandbagging Detection via Consistency Probing
},
author={
Atharshlakshmi Vijayakumar, Balakrishnan Vaisiya
},
date={
1/11/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


