Mar 23, 2026
Can Language Models Tell When They Are Being Evaluated?
Chengheng Li Chen, Kyuhee Kim
We investigate whether language models can distinguish evaluations from deployment and whether the underlying representations can be steered to counteract sandbagging. Testing 14 models from 8 families on 2,857 forced-choice pairs across 33 thematic pools, we find that all models detect evaluations above chance (56.6--87.0\%) with broad cross-model agreement ($\rho > 0.7$ for the majority of model pairs), and that format (not topic) is the primary cue. We extract two contrastive steering directions on Llama-3.1-8B and Phi-4: a \emph{format direction} (eval-vs-deploy structure) and a \emph{mode direction} (sandbagging-vs-honest regime). Combined steering closes 85\% of the sandbagging gap on WMDP and transfers to OOD benchmarks (73--89\% recovery). On Phi-4, the mode direction is inverted at its critical layer, encoding condition identity rather than capability. Both models from different families exhibit the same pattern: format is a primary confound, the two directions are complementary, and steering is localised despite universal probe separability.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Can Language Models Tell When They Are Being Evaluated?
},
author={
Chengheng Li Chen, Kyuhee Kim
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


