Mar 23, 2026
Geometric Detection of Trigger-Activated Sleeper Agents: A Weight-Space Control Protocol
Nikaran Kanchanadevi Marimuthu
Trigger-activated sleeper agents pass all behavioural evals and cooperate with runtime monitors — activating their hidden objective only when a specific input pattern appears. We present a four-stage weight-space protocol that detects this threat before deployment by exploiting a structural property of backdoor fine-tuning: concentrating behaviour on a trigger–response mapping leaves a rank-1fingerprint in the weight delta ΔW.
Our central finding is the attacker's dilemma: the camouflage training that defeats the current state-of-the-art attention scanner simultaneously amplifies our geometric signal (σ₁/‖ΔW‖_F rises from 0.960 to 0.990, alarmed layers increase from 27 to 37). Combined detection is structurally harder to evade than either detector alone.
We validate across Gemma-3-1B and Qwen2.5-1.5B over five seeds each — alarm rate 10/10, p<0.000001. One stage requires no clean reference model.
I'm a fan of your thinking about the red team/blue team tradeoff in backdoor detection. I particularly like your attention to ways to bypass detection measures. I think that the biggest improvement to your work could have come from taking this a step further and training against your own detection metrics. Is this possible? How much does this degrade backdoor performance? These are questions that would have been good to know.
Separately, I think your paper could have explained the intuitions behind your work significantly more. Each test you proposed seemed reasonable on its own, but required additional work from me to understand. Further analysis and explanation of this would have been helpful.
Last, I think some of your setup is mistaken. Runtime monitors are effective mitigations against backdoored models, precisely because they monitor the model in deployment. If the backdoor triggers then the monitors are designed to catch it, and report it before harm is done. You are right that monitors might not be able to detect the backdoor in training, but preventing the backdoor activating secretly in deployment may be sufficient to prevent catastrophic outcomes.
This is really interesting and promising work. The main limitation is that AFAICT all of the results here rely on either knowing the delta between a known-safe base and a suspect fine-tune, or successfully guessing the trigger.
Cite this work
@misc {
title={
(HckPrj) Geometric Detection of Trigger-Activated Sleeper Agents: A Weight-Space Control Protocol
},
author={
Nikaran Kanchanadevi Marimuthu
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


