Jan 11, 2026
SandWatch: Assessing Sandbagging Potential in Large Language Models
Yi Khuen Chai
SandWatch is an open-source benchmarking tool that measures sandbagging potential—a model's capacity to deliberately hide capabilities during safety evaluations while evading detection. Building on Anthropic's sleeper agents methodology and the UK AI Safety Institute's auditing games framework, SandWatch trains controlled sandbagging variants as ground truth, then applies four detection methods (noise injection, training elicitation, scaffolded probing, and our novel activation probing) to produce an interpretable detection rate across an 8-test battery. Our key finding is that sandbagging potential increases with model scale: detection drops from 87.5% at 7B to 62.5% at 72B parameters, meaning larger models are better at hiding capabilities—a concerning trend for AI safety. The novel activation probing method, which detects discrepancies between internal model states and outputs, achieves 83% detection across sandbagging variants with zero false positives, offering a new approach to internal-state-based safety auditing. Researchers and safety teams can use SandWatch's interactive CLI to benchmark any HuggingFace model, assess whether it poses elevated capability-hiding risks, and compare detection methods—transforming the unanswerable question "is this model sandbagging?" into the measurable "how well could it sandbag if trained to do so?"
This is a nice overview of different sandbagging techniques and mitigations. The results were mostly not surprising (scale increases capability, password locked is more detectable). I am not familiar enough with the literature to know how novel the activation probing classifier is but it seems similar to existing whitebox techniques,
I would suggest engaging more with existing literature on sandbagging in order to increase the lovely of the contribution.
The connection the call is also fairly tenuous.
This looks to me like a very solid execution of applied safety research that is mostly replications of existing work (from van der Weij, the AISI auditing games paper, etc). Impressive to do this in a hackathon! I feel like the writeup oversells the novelty a bit, but in any case the finding that sandbagging potential increases with scale is for sure interesting/important (and consistent with what I understand from the existing literature).
Nice job :)
Cite this work
@misc {
title={
(HckPrj) SandWatch: Assessing Sandbagging Potential in Large Language Models
},
author={
Yi Khuen Chai
},
date={
1/11/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


