Jun 2, 2025
Manipulating Self-Preference for Large Language Models
Matthew Nguyen, Simon Fu, Matthew Bozoukov, Dani Roytburg, Jou Barzdukas
Large language models (LLMs) carry great value as evaluators of synthetic data for research and production settings. However, recent research shows that language models exhibit bias towards their own responses in blind model-to-model evaluation settings. Self-preference bias shows clear negative effects on effective Judge Model Development, our chosen research track. We first identify instances of this phenomenon at the behavioral level, robustly demonstrating that two models – Llama 3.1-8b-Instruct and DeepSeek V3 – exhibit this behavior. Then, focusing on Llama, we construct a steering vector from the residual stream of the
model that represents the ``self-preference'' direction. Then, we demonstrate that the vector causally impacts a model’s ability to assert self-preference by applying the vector to the model’s output as it generates it (steering). When steered in the positive self-preference direction , we find that the model asserts self-preference on 85\% of the examples where it previously did not. Conversely, when steered in the negative direction, we find that the model asserts non-self-preference on 25\% of the examples where it previously did not. These findings suggest potential approaches to de-biasing expert orchestrators such as judges and routers, potentially enabling a fair allocation of responses. Further research is necessary to scale this approach to larger models, as well as to determine the impact on orchestration systems in production. This aligns with the Judge Model Development track because the prevalence of self-preference in judges is not negligible especially when it is compared to responses from other models (not humans). One of the key components of the Expert Orchestration Architecture is the judges, which help the router make a better educated decision on which model to direct the query to. One of the main criteria users care about is bias. We show that this bias exists in a specific judge model when given a choice between another model's response and further mitigate this.
Curt Tigges
This was an excellent use case for the probing/steering methodology, and seems to show quite useful results! Seems like a method that could be put into place in production relatively easily for both activation monitoring and steering.
Narmeen
Constructive critique:
Strength:
Cool problem to investigate
Shows that models tend to prefer their own responses which have a direct application to the judge selection for routing.
Also finds a steering vector that effectively debiases the judge showing that the bias is rectifiable.
Weaknesses:
Not sure if debiasing would work across models, a more varied model set would reveal that.
Future directions:
I would test if this preference is just towards the model’s own responses or does it extend to responses of models in the same family. This is likely to have more applications.
Testing steering against fine tuning the judge to debias/ measure how effective the steering is and also how universal it is across model families. This would make a good paper.
Quantify the bias in routing decisions/ accuracy.
Anosha Rahim
You are showing how self-preference bias can be measured and mechanistically steered which is a meaningful, actionable insight. Would be curious to see this implemented in a router pipeline for a broader range of tasks. kudos team.
Cite this work
@misc {
title={
@misc {
},
author={
Matthew Nguyen, Simon Fu, Matthew Bozoukov, Dani Roytburg, Jou Barzdukas
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}