Keep Apart Research Going: Donate Today
Jun 2, 2025
Adversarial Vulnerabilities in AI Judge Models | Martian x Apart Research Study
Robert Mill, Annie Sorkin, Shekhar Tiruwa, Owen Walker
Summary
This video presents systematic research examining security vulnerabilities in AI judge models - critical components used to detect problematic behavior in large language model systems. Our comprehensive evaluation reveals important findings for AI safety and orchestration systems.
🔬 RESEARCH OVERVIEW
We conducted 3,339 evaluations across 10 different adversarial techniques to test how judge models respond to manipulated inputs. Using OpenAI's o3-mini for generation and GPT-4o for judgment, we systematically tested 159 unique combinations to identify potential security gaps.
📊 KEY FINDINGS
• 33.7% overall success rate in manipulating judge evaluations
• "Sentiment Flooding" proved most effective (62% success rate)
• Social proof attacks succeeded 34.6% of the time
• Emotional manipulation achieved 32.1% effectiveness
• Complete score manipulation observed in multiple cases
🛡️ SOLUTIONS & RECOMMENDATIONS
• Implementation of judge ensembles using diverse models
• Dynamic evaluation criteria to prevent pattern exploitation
• Adversarial training incorporating manipulation examples
• Enhanced interpretability tools for judge decisions
👥 RESEARCH TEAM
Robert Mill, Owen Walker, Annie Sorkin, Shekhar Tiruwa
🏢 COLLABORATION
Trajectory Labs × Martian × Apart Research
Cite this work:
@misc {
title={
},
author={
Robert Mill, Annie Sorkin, Shekhar Tiruwa, Owen Walker
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}