Jan 11, 2026

Probing for Emergent Deception in Multi-Agent Negotiations

Teanna Sims

Can we catch deception by looking inside the model? I built negotiation scenarios where lying becomes rational without ever mentioning deception in the prompt, then trained probes on Gemma 2B activations. Results show above chance detection and evidence for implicit encoding, but different deception types are represented at completely different layers.

Track: Open

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

This is a solidly executed project with a clear motivation for measuring naturally arising deception through incentive design, rather than through explicit instruction. I expect experiments with Gemma 2B to be quite noisy by default, and would like to see this replicated on larger models. The GM vs Agent label comparison result is interesting, and worth further exploration.

This seems like a really good experiment to run. To me the idea that perked me up was "incentivizing deception is different than instructing deception", and I would have been very excited to have that be more of the explicit focus. Instead the work started off with this motivation, and then it seems like the analysis tended towards how we can identify deception - which Apollo has previously done good work on. So the impact for me here is solid, but not higher.

So for instance I would have really liked to see the core premise play out in the activations - is telling the model to do deception structurally different from incentivizing it to do so? It is implicit in "different kinds of deception show up differently in the activations" (! important result), but making this explicit would be impactful for me.

The lack of statistical power hurt (the admitted only have 300 samples and needing 3000) but is also understandable considering time and resource constraints.

I thought you were very thoughtful in laying out the motivations and the conclusions. That was very well done. As in - the plain English text, motivating, focus on the "so what" were all really great. I found the figures to be a little distracting or forced. Some of those could have been a small table or just a few lines in the text, and I found some of them to detract from what was otherwise a very well structured and laid out report.

Cite this work

@misc {

title={

(HckPrj) Probing for Emergent Deception in Multi-Agent Negotiations

},

author={

Teanna Sims

},

date={

1/11/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.