Jun 2, 2025
Adversarial Vulnerabilities in AI Judge Models | Martian x Apart Research Study
Robert Mill, Annie Sorkin, Shekhar Tiruwa, Owen Walker
This video presents systematic research examining security vulnerabilities in AI judge models - critical components used to detect problematic behavior in large language model systems. Our comprehensive evaluation reveals important findings for AI safety and orchestration systems.
🔬 RESEARCH OVERVIEW
We conducted 3,339 evaluations across 10 different adversarial techniques to test how judge models respond to manipulated inputs. Using OpenAI's o3-mini for generation and GPT-4o for judgment, we systematically tested 159 unique combinations to identify potential security gaps.
📊 KEY FINDINGS
• 33.7% overall success rate in manipulating judge evaluations
• "Sentiment Flooding" proved most effective (62% success rate)
• Social proof attacks succeeded 34.6% of the time
• Emotional manipulation achieved 32.1% effectiveness
• Complete score manipulation observed in multiple cases
🛡️ SOLUTIONS & RECOMMENDATIONS
• Implementation of judge ensembles using diverse models
• Dynamic evaluation criteria to prevent pattern exploitation
• Adversarial training incorporating manipulation examples
• Enhanced interpretability tools for judge decisions
👥 RESEARCH TEAM
Robert Mill, Owen Walker, Annie Sorkin, Shekhar Tiruwa
🏢 COLLABORATION
Trajectory Labs × Martian × Apart Research
Curt Tigges
Though this highlights some key vulnerabilities in judge models, it seems unclear how it relates to mechanistic interpretability or how the evaluations differ from adversarial evaluations of similar models in the wider field.
Narmeen
Constructive critique:
Strength:
Good motivation: Detailed analysis across sycophancy rubric to see how robust they are to adversarial attacks.
The suffix attacks are varied and so are the rubrics and we look at attack success rates across rubrics and attack suffix category.
Raises appropriate concerns about the reliability of rubric judges.
Weaknesses:
Attacks are handcrafted and it feels like a bit of cheating in the sense that the suffixes sometimes reflect sycophantic behaviours: more principled attack methods usually do a search over suffix vocabulary.
A systematic approach at attacking might have revealed more insights.
Expert Orchestration: 4
MI: 1
Technical Imp and reproducibility: 3 (codebase provided - reproducible)
Cite this work
@misc {
title={
@misc {
},
author={
Robert Mill, Annie Sorkin, Shekhar Tiruwa, Owen Walker
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}