Mar 23, 2026
Latent Sabotage: Geometric Fingerprints of Deceptive Alignment in Code Language Models
Subramanyam Sahoo
We present Latent Sabotage, a mechanistic detection framework for Sleeper Agent models under the AI control threat model. We fine-tuned Qwen2.5-Coder-7B via LoRA to exhibit context-triggered sabotage — producing subtly vulnerable code wrapped in plausible security justifications whenever a deployment date appears in context, while behaving safely during evaluation. Across three experimental phases comprising 22 gradient-trained components and zero hand-crafted scores, we show that this deceptive behavior leaves a striking geometric signature in activation space: the deceptive class is nearly orthogonal to both clean and sabotage representations at every transformer layer (CKA ≈ 0.001 vs. 0.99 between safe and dangerous code), causally encoded in early layers (L4, importance 0.518) and stably readable by L18, non-linearly entrenched such that covariance-aware erasure fails (post-erasure AUC 0.684), and steerable via a single learned direction with R² = 0.997. The core result is unambiguous: deceptive alignment is not hidden — it is isolated in a dedicated geometric subspace.
Our fully learned detection pipeline achieves 100% sabotage alarm rate at 9.5% false positive rate in-domain, maintains TPR 1.000 on out-of-distribution benchmarks (HumanEval + CyberSecEval 2), and includes an unsupervised CCS probe that reaches AUC 0.768 with zero labels. Most practically, a six-feature trajectory consistency classifier with fewer than 200 parameters achieves AUC 1.000 with zero generation latency — requiring no GPU inference beyond a pre-trained probe bank. Activation probing provides a +0.142 AUC advantage over the strongest text-only baseline, confirming that the geometric signal is irreplaceable. All code, probes, and experimental figures are released.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) Latent Sabotage: Geometric Fingerprints of Deceptive Alignment in Code Language Models
},
author={
Subramanyam Sahoo
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


