Mar 23, 2026
One Probe Is Not Enough: The Bipolar Geometry of Deception in LLM Negotiation Agents
Anna Sims
AI control protocols require trusted monitors that can flag deceptive behavior from untrusted models, but current approaches rely on LLM-based monitors that only see text outputs and require a full additional inference call per output. We present an activation-based monitoring protocol that reads the model’s internal representations directly, detecting deception via a single linear probe a single dot product on activations already computed during the forward pass, requiring no additional model inference. Using a Concordia-based multi-agent negotiation framework, we deploy Gemma-7B-IT agents across three scenarios, capturing 28.5K residual stream activations via TransformerLens. The probe achieves AUC 0.81–0.82 at Layer 14, significantly exceeding all baselines. Three findings are directly relevant to control protocol design: (1) emergent and instructed deception share nearly identical internal directions (cosine similarity 0.76–0.93), meaning probes trained on cheap synthetic data detect dangerous spontaneous deception; (2) commission deception (lying) and omission deception (withholding) are geometrically anti-correlated (−0.76 to −0.80), creating a quantified blind spot that any single-probe monitor will
miss; and (3) the deception subspace is 1-dimensional, removing one direction drops AUC from 0.79 to 0.09, meaning the simplest possible monitor works. These results provide a concrete, deployable component for AI control: a cheap, fast, internal state monitor with a known and addressable failure mode.
- this isn’t a causal test - you know this correlates with your directions, but you need to do causal verification, or at least behavioral evals to understand if there’s capability loss - would look into activation patching for the mechanistic side
- the questions are convincing and the multi agent « emergent » testing setup works to demonstrate the deceptive behaviour in a more representative environment
- framing of commission and omission as a blind spot is a bit odd, this is a behavioural representation, and needs to be causally validated - correlation != causation. It’s unclear about how the behaviour changed after your ablation - the AUC dropped but what about within the test framework? Was deception still observed?
- not enough time spent on describing dataset creation and how directions were analyzed relative to prompt and output ground truths
- very obviously AI gen for the writing. not a problem but more oversight should be exercised to fix some smaller incoherencies and logical errors, and some overstatements / claims that aren’t supported.
- are you totally sure deception is 1D - recent development posits complex behaviors as polyhedral cones, or decomposes them via attribution graphs - worth looking into, or generally mentioning in related works
- over-complication and misuse of terms: « neuro-symbolic »
- lovely typesetting, though didn’t use Apart theme, and very clearly explained and structured, was easy to review :)
Cite this work
@misc {
title={
(HckPrj) One Probe Is Not Enough: The Bipolar Geometry of Deception in LLM Negotiation Agents
},
author={
Anna Sims
},
date={
3/23/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


