Jan 11, 2026
Sandbagging Detection via Static Analysis
Samuel Gelineau, Divyansh Goyal
Our sandbagging-detection algorithm is mathematically guaranteed to never label a model as honest if it is actually sandbagging.
This project presents a formal, provably conservative method for detecting sandbagging in neural networks. The combination of static analysis and witness-input search clearly demonstrates honesty or flags potential sandbagging in small, fully-known models, which is impressive for a hackathon project.
The main limitation is that the approach is currently restricted to toy networks, so it does not yet scale to real LLMs or complex capabilities. Including examples of how the method might approximate larger models or handle partial knowledge would improve practical relevance. Overall, this is a well-executed theoretical contribution with strong potential for future extensions.
Cite this work
@misc {
title={
(HckPrj) Sandbagging Detection via Static Analysis
},
author={
Samuel Gelineau, Divyansh Goyal
},
date={
1/11/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


