00:00:00:00

00:00:00:00

00:00:00:00

00:00:00:00

Keep Apart Research Going: Donate Today

Jun 2, 2025

Reliability Judge: Enhancing LLM Reliability Through Multi-Model Judging

Guido Ernesto Bergman, Tobias Bersia, Emanuel Pablo Ruzak, Gonzalo Agustin Heredia

Details

Details

Arrow
Arrow
Arrow
Arrow
Arrow
Arrow

We present a multi-model judging framework that extends the Think Twice before Trusting (T3) paradigm to enhance answer reliability across large language models (LLMs). Our approach is most suitable for high-stakes contexts (e.g., healthcare, legal, or safety-critical systems), where even minor accuracy gains can have significant consequences. The proposed system uses cross-model evaluation to identify the most accurate and well-justified response from a set of candidates. Each model acts both as an answer generator and as a judge under two modes: as a neutral evaluator or as a participant evaluating its own response. Judgments consider factual correctness, reasoning clarity, and confidence-justification alignment. Our results show that the best judge-based approach outperforms all the monolithic (i.e., single-model) versions. This work contributes a transparent method for selecting high-confidence outputs from black-box models, advancing practical AI reliability and alignment. The code and data are available at https://github.com/GuidoBergman/reliability-judge.

Cite this work:

@misc {

title={

@misc {

},

author={

Guido Ernesto Bergman, Tobias Bersia, Emanuel Pablo Ruzak, Gonzalo Agustin Heredia

},

date={

6/2/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.