Jan 11, 2026
Probing for Emergent Deception in Multi-Agent Negotiations
Teanna Sims
Can we catch deception by looking inside the model? I built negotiation scenarios where lying becomes rational without ever mentioning deception in the prompt, then trained probes on Gemma 2B activations. Results show above chance detection and evidence for implicit encoding, but different deception types are represented at completely different layers.
Track: Open
This is a solidly executed project with a clear motivation for measuring naturally arising deception through incentive design, rather than through explicit instruction. I expect experiments with Gemma 2B to be quite noisy by default, and would like to see this replicated on larger models. The GM vs Agent label comparison result is interesting, and worth further exploration.
This seems like a really good experiment to run. To me the idea that perked me up was "incentivizing deception is different than instructing deception", and I would have been very excited to have that be more of the explicit focus. Instead the work started off with this motivation, and then it seems like the analysis tended towards how we can identify deception - which Apollo has previously done good work on. So the impact for me here is solid, but not higher.
So for instance I would have really liked to see the core premise play out in the activations - is telling the model to do deception structurally different from incentivizing it to do so? It is implicit in "different kinds of deception show up differently in the activations" (! important result), but making this explicit would be impactful for me.
The lack of statistical power hurt (the admitted only have 300 samples and needing 3000) but is also understandable considering time and resource constraints.
I thought you were very thoughtful in laying out the motivations and the conclusions. That was very well done. As in - the plain English text, motivating, focus on the "so what" were all really great. I found the figures to be a little distracting or forced. Some of those could have been a small table or just a few lines in the text, and I found some of them to detract from what was otherwise a very well structured and laid out report.
Cite this work
@misc {
title={
(HckPrj) Probing for Emergent Deception in Multi-Agent Negotiations
},
author={
Teanna Sims
},
date={
1/11/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


