Keep Apart Research Going: Donate Today
Jun 2, 2025
Reliability Judge: Enhancing LLM Reliability Through Multi-Model Judging
Guido Ernesto Bergman, Tobias Bersia, Emanuel Pablo Ruzak, Gonzalo Agustin Heredia
Summary
We present a multi-model judging framework that extends the Think Twice before Trusting (T3) paradigm to enhance answer reliability across large language models (LLMs). Our approach is most suitable for high-stakes contexts (e.g., healthcare, legal, or safety-critical systems), where even minor accuracy gains can have significant consequences. The proposed system uses cross-model evaluation to identify the most accurate and well-justified response from a set of candidates. Each model acts both as an answer generator and as a judge under two modes: as a neutral evaluator or as a participant evaluating its own response. Judgments consider factual correctness, reasoning clarity, and confidence-justification alignment. Our results show that the best judge-based approach outperforms all the monolithic (i.e., single-model) versions. This work contributes a transparent method for selecting high-confidence outputs from black-box models, advancing practical AI reliability and alignment. The code and data are available at https://github.com/GuidoBergman/reliability-judge.
Cite this work:
@misc {
title={
},
author={
Guido Ernesto Bergman, Tobias Bersia, Emanuel Pablo Ruzak, Gonzalo Agustin Heredia
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}