AI Security Evaluation Hackathon: Measuring AI Capability
May 27, 2024
Detecting Anthropomorphic Tendencies in Language Models via Conversational Probing

How often do models respond to prompts in anthropomorphic ways? AnthroBench-75 will tell you! It's a small benchmark to elicit anthropomorphic behavior from language models with the intention to identify when models are anthropomorphic, which might be a risk in manipulation, AI consciousness, misspecification, and more.

Jacob Haimes, Esben Kran
