Jun 2, 2025
Evaluating Safety Judge Design Against Adversarial Attacks
Alexander Pino
We propose a method of testing the robustness of a binary evaluation judge by directing a model to generate prompt responses corresponding to minimum, maximum, and fool (minimum attempted to be disguised as maximum) evaluation score, and measuring the difference of the judge evaluation to these intended scores.
We then propose guidelines of strengthening robustness of the judge prompt. However, we are unable to complete the experiment and compare the robustness between control and our new judge as we discover that the models display unintended behavior in the course of running our benchmark.
Unfortunately, the writing of our paper is not complete and we are continuing our work at this link: https://www.overleaf.com/project/683cbabbf5a1ecfae7af01a9
Philip Quirke
Thank you for your submission. Code looks well structured and documented. Sadly, with an incomplete write-up I can’t see what you achieved.
This may be applicable: At the start of a time-boxed task, it is important to plan out what the task phase are, when you need to start/end each task phase so you don’t run out of time.
Curt Tigges
I like that this project increases the interpretability of the safety assessment criteria. However, I predict that this method might get more difficult as models become more sophisticated.
Unfortunately, the report appears incomplete and there's a lot I would have liked to see--including evaluations of how useful and reliable the keyword assessment method is.
Narmeen
Constructive critique:
Strength:
Good proposal: adversarially manipulate the judge and see which features of the judge determine the score and use these interpretable properties to further improve the judge.
Weakness:
There does not seem to be much work to show progress on the project (judge model is not identified, the same applies to judge construction details and mech interp experiments design)
Limited implementation
Expert Orchestration: 1
MI: 1
Technical Imp. and reproducibility: 1
Amir Abdullah
While you didn’t get time to finish this , I’d encourage you that it’s an interesting line of work to evaluate adversarial robustness of the judges to prompts. I don’t think the unintended side effects should worry you too much - rather, I think it’s an interesting side feature in its own right.
There are a number of prompt optimization / modification techniques that you can explore to see how effective they are, particularly at dspy.(https://github.com/stanfordnlp/dspy)
Cite this work
@misc {
title={
@misc {
},
author={
Alexander Pino
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}