Mar 23, 2026
Can Language Models Tell When They Are Being Evaluated?
Chengheng Li Chen, Kyuhee Kim
We investigate whether language models can distinguish evaluations from deployment and whether the underlying representations can be steered to counteract sandbagging. Testing 14 models from 8 families on 2,857 forced-choice pairs across 33 thematic pools, we find that all models detect evaluations above chance (56.6--87.0\%) with broad cross-model agreement ($\rho > 0.7$ for the majority of model pairs), and that format (not topic) is the primary cue. We extract two contrastive steering directions on Llama-3.1-8B and Phi-4: a \emph{format direction} (eval-vs-deploy structure) and a \emph{mode direction} (sandbagging-vs-honest regime). Combined steering closes 85\% of the sandbagging gap on WMDP and transfers to OOD benchmarks (73--89\% recovery). On Phi-4, the mode direction is inverted at its critical layer, encoding condition identity rather than capability. Both models from different families exhibit the same pattern: format is a primary confound, the two directions are complementary, and steering is localised despite universal probe separability.
Exceptionally strong work. The finding that format, not topic, drives evaluation detection is directly actionable for benchmark design. The steering framework closed 85% of the sandbagging gap on WMDP with strong OOD transfer, and the Phi-4 mode inversion is a genuinely novel mechanistic insight. Rigorous methodology and excellent writing throughout
I think this is very thorough and impressive work!
I like the work with the Bradley-Terry model, and inter-model comparisons!
I think the steering section on style was informative, and had quite clear results. I was slightly less excited by the mode steering, both because the applicability of the methodology to future evaluations is less clear (getting sandbagging vs non-sandbagging examples may not be possible) and because I think you slightly overclaim when you say that the sandbagging vs non-sandbagging distinction is linearly encoded. Your results on Phi-4 seem to show that while a linear probe may pick up on sandbagging, the linear direction is not necessary causal of sandbagging.
I would be excited to see more interventions to prevent eval-awareness, for example by adding stories and context to questions, using your Bradley-Terry model!
Cite this work
@misc {
title={
(HckPrj) Can Language Models Tell When They Are Being Evaluated?
},
author={
Chengheng Li Chen, Kyuhee Kim
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


