Mar 31, 2025
Can models use their Chain-of-Thought to attack overseers?
David Bai, Simon Lermen, Abhinav Pola
1st place
Summary
This project explores vulnerabilities in AI evaluation mechanisms, specifically focusing on how AI agents might influence their overseeing AI judges through manipulative instructions embedded within their Chain-of-Thought (CoT) reasoning. The research involved experiments with DeepSeek-R1 conditioned to insert subtle directives aimed at deceiving evaluation models. Results showed significant variation in the robustness of different evaluator models against this manipulation, with some models, like Claude 3.5 Sonnet, resisting influence effectively, while others, like Llama 3.3 70B and Gemini 2.0 Flash, proved highly susceptible. The study highlights critical concerns regarding AI evaluation integrity, recommending improvements in maintaining consistent and principled evaluation boundaries.
Cite this work:
@misc {
title={
Can models use their Chain-of-Thought to attack overseers?
},
author={
David Bai, Simon Lermen, Abhinav Pola
},
date={
3/31/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}