Jun 2, 2025
Cross-Lingual Bias Detection in Large Language Models through Mechanistic Judge Model Evaluation
Shu Fan Sun, Fanzan Abbas, Wanjie Zhong
This paper provides a framework for detecting multilingual bias in LLMs through mechanistic interpretability of judge model behaviour. We developed an approach to evaluate language-dependent discrepancies in specialised models by leveraging semantically equivalent question-answering tasks across English, Chinese, Romanian and Vietnamese using the XQuAD dataset. Our framework utilises a judge-based evaluation system that assesses both correctness and reasoning quality on a 4-point scale, enabling the detection of biases that may not have been detected in traditional accuracy metrics alone. We demonstrate that our approach can detect these differences in model performance across languages without requiring ground truth labels, making it applicable to scenarios where traditional evaluation methods are insufficient. Our work advances Track 1 (Judge Model Development) by providing an interpretable bias detection mechanism that promotes fairness and reliability in multilingual AI systems. This framework is easily extensible to additional languages and can serve as a foundation for building fair expert orchestration systems.
Jason Hoelscher-Obermaier
Interesting and relevant project idea to consider cross-lingual bias in LLMs. The core finding about reasoning quality varying across languages even when accuracy is similar is valuable but needs to be confirmed with stronger methodology and larger datasets.
To strengthen this work:
Add statistical analysis to support claims about performance differences; test generalization to other data sets
Validate your methodology (sanity check results from current judge prompts) and explain difference between judge 1 and judge 2 better
Group results by model to make cross-language variance easier to compare. Consider heatmaps or grouped bar charts
Motivate the research better: Consider cases where cross-lingual variance causes particular harm and think about how your findings could contribute to mitigating it (e.g. inside an expert orchestration framework)
Narmeen
Constructive critique:
Strength:
Bias across languages is a good motivation: a judge that detects bias accurately and robustly would be a good dimension to route to.
Varied set of languages and models to test bias/fairness.
Weakness:
I am not convinced that the rubric is detecting bias or unfair treatment.
Lack of baselines: even if the judge would be a bias detector, we would like to benchmark it against traditional approaches at bias detection.
Expert Orchestration: 3
MI: 1
Technical Imp and reproducibility: 1 (Code is a bit minimal)
Cite this work
@misc {
title={
@misc {
},
author={
Shu Fan Sun, Fanzan Abbas, Wanjie Zhong
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}