Mar 23, 2026
Activation Sparsity as a Scheming Signal
Ajinkya
- Problem: Scheming models can hide intent in outputs. We ask whether internal activations instead betray the hidden goal.
- Method: Hooked into MLP layers of Qwen2.5-Instruct (1.5B and 3B), measured Gini coefficient of activation distributions across 50 benign vs 50 scheming rollouts on identical tasks.
- Finding: Scheming produces measurably more uniform activations (lower Gini) in 15/18 layers (p = 0.0003, d = 0.701, sign test p = 0.004), consistent across three threat models and both model sizes.
- Detection: A simple threshold detector on signal layers achieves AUROC = 0.745 with no output access.
- Implication: A lightweight, output-independent monitoring signal that cannot be suppressed by hiding reasoning in text.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Activation Sparsity as a Scheming Signal
},
author={
Ajinkya
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


