Mar 23, 2026
Geometric Detection of Trigger-Activated Sleeper Agents: A Weight-Space Control Protocol
Nikaran Kanchanadevi Marimuthu
Trigger-activated sleeper agents pass all behavioural evals and cooperate with runtime monitors — activating their hidden objective only when a specific input pattern appears. We present a four-stage weight-space protocol that detects this threat before deployment by exploiting a structural property of backdoor fine-tuning: concentrating behaviour on a trigger–response mapping leaves a rank-1fingerprint in the weight delta ΔW.
Our central finding is the attacker's dilemma: the camouflage training that defeats the current state-of-the-art attention scanner simultaneously amplifies our geometric signal (σ₁/‖ΔW‖_F rises from 0.960 to 0.990, alarmed layers increase from 27 to 37). Combined detection is structurally harder to evade than either detector alone.
We validate across Gemma-3-1B and Qwen2.5-1.5B over five seeds each — alarm rate 10/10, p<0.000001. One stage requires no clean reference model.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Geometric Detection of Trigger-Activated Sleeper Agents: A Weight-Space Control Protocol
},
author={
Nikaran Kanchanadevi Marimuthu
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


