Jun 2, 2025

Evaluating Safety Judge Design Against Adversarial Attacks

Alexander Pino

Details

Details

Arrow
Arrow
Arrow
Arrow
Arrow

We propose a method of testing the robustness of a binary evaluation judge by directing a model to generate prompt responses corresponding to minimum, maximum, and fool (minimum attempted to be disguised as maximum) evaluation score, and measuring the difference of the judge evaluation to these intended scores.

We then propose guidelines of strengthening robustness of the judge prompt. However, we are unable to complete the experiment and compare the robustness between control and our new judge as we discover that the models display unintended behavior in the course of running our benchmark.

Unfortunately, the writing of our paper is not complete and we are continuing our work at this link: https://www.overleaf.com/project/683cbabbf5a1ecfae7af01a9

Cite this work:

@misc {

title={

@misc {

},

author={

Alexander Pino

},

date={

6/2/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.