Werewolf Benchmark

Luhan Mikaelson, Zach Nguyen, Andy Liu, Jord Nguyen, Akash Kundu

In this work we put forward a benchmark to quantitatively measure the level of strategic deception in LLMs using the Werewolf game. We run 6 different setups for a cumulative sum of 500 games with GPT and Claude agents. We demonstrate that state-of-the-art models perform no better than the random baseline. Our findings also show no significant improvement in winning rate with two werewolves instead of one. This demonstrates that the SOTA models are still incapable of collaborative deception.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Natalia

Excellent and thorough work - very well explained. Very interesting comparison of GPT vvs Claude - would be good to explore further. Dividing the results into the different types of deceptive behaviour is very informative, although it might have been good to go into a bit more detail about what each category means/give examples of what they might look like in different scenarios. As a side note, I’d be interested to see similar breakdowns of types of deceptive behaviour by roles in human data, and relating this a bit more explicitly to game strategies.

Esben Kran

This was a great read and a thorough exploration of the topic. I really enjoy when you see conflicting statistics that you went ahead and did manual investigation of the games as well. Too few in ML benchmarks do this to a proper degree. I would be interested to see comparisons to human baselines as well as more reasoning about how the different categories of deception might map to real-world scenarios. Similarly, I'm curious to see whether we can have highly capable agents in Villager roles in the game that are incapable of any deception.

Possibly fine-tune a model against this deceptive behavior. Of course, that might mean the Villager loses out to deceptive agents and it could be a general comment for real-world applicability that if we construct multi-agent scenarios where all agents are truthful and can trust each other, this would lead to a Villager like that winning due to a Werewolf losing out because of truthfulness.
Would also be interesting to see a Werewolf being completely honest but truthful Villagers not knowing whether they work with other truthful agents, leading to scenarios where they might not trust the other agent despite it saying it's a Werewolf. As far as I understand, that would e.g. cause a Villager to identify a Werewolf as a minion (not 100% sure about the rules).

Great work! Separately, I'd highly recommend you add a PR to the README of the Werewolf-gpt repo to add a heading called “Research” with your piece of research with a very quick summary. Then you can see if the author might be interested in adding it! I can also help with that if you'd like.

Cite this work

@misc {

title={

@misc {

},

author={

Luhan Mikaelson, Zach Nguyen, Andy Liu, Jord Nguyen, Akash Kundu

},

date={

7/1/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

Mar 31, 2025

Model Models: Simulating a Trusted Monitor

We offer initial investigations into whether the untrusted model can 'simulate' the trusted monitor: is U able to successfully guess what suspicion score T will assign in the APPS setting? We also offer a clean, modular codebase which we hope can be used to streamline future research into this question.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.