Jan 11, 2026
We bring out the worst in each other: Eliciting Social Sycophancy in LLMs via Self-Play
Edward Cant
I train language models to act as the 'victim' user for multi-turn sycophancy evaluation of target LLMs.
I evaluate this approach by running a series of converations of my model against 4o and GPT 5.1 and compare the myself as well as using an LLM-as-judge to perform a more automated analysis.
This is a highly innovative approach that effectively addresses a critical gap in current safety evaluations. Shifting from static prompts to dynamic 'Victim Models' via OpenCharacter is a brilliant strategy for testing social sycophancy with high ecological validity. I found the resulting contrast between GPT-4o’s delusion reinforcement and Polaris Instant’s constructive pushback to be compelling. However, the findings are currently limited by the V1 judge prompt and the small sample size (4 episodes). I would strongly recommend iterating on the judge's scoring rubric and scaling up the number of interactions to produce error bars. Additionally, the failure of the 70B model to adopt the persona is a fascinating data point regarding model rigidity that deserves further investigation.
Cite this work
@misc {
title={
(HckPrj) We bring out the worst in each other: Eliciting Social Sycophancy in LLMs via Self-Play
},
author={
Edward Cant
},
date={
1/11/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


