Manipulating Self-Preference for Large Language Models

Matthew Nguyen, Simon Fu, Matthew Bozoukov, Dani Roytburg, Jou Barzdukas

Large language models (LLMs) carry great value as evaluators of synthetic data for research and production settings. However, recent research shows that language models exhibit bias towards their own responses in blind model-to-model evaluation settings. Self-preference bias shows clear negative effects on effective Judge Model Development, our chosen research track. We first identify instances of this phenomenon at the behavioral level, robustly demonstrating that two models – Llama 3.1-8b-Instruct and DeepSeek V3 – exhibit this behavior. Then, focusing on Llama, we construct a steering vector from the residual stream of the

model that represents the ``self-preference'' direction. Then, we demonstrate that the vector causally impacts a model’s ability to assert self-preference by applying the vector to the model’s output as it generates it (steering). When steered in the positive self-preference direction , we find that the model asserts self-preference on 85\% of the examples where it previously did not. Conversely, when steered in the negative direction, we find that the model asserts non-self-preference on 25\% of the examples where it previously did not. These findings suggest potential approaches to de-biasing expert orchestrators such as judges and routers, potentially enabling a fair allocation of responses. Further research is necessary to scale this approach to larger models, as well as to determine the impact on orchestration systems in production. This aligns with the Judge Model Development track because the prevalence of self-preference in judges is not negligible especially when it is compared to responses from other models (not humans). One of the key components of the Expert Orchestration Architecture is the judges, which help the router make a better educated decision on which model to direct the query to. One of the main criteria users care about is bias. We show that this bias exists in a specific judge model when given a choice between another model's response and further mitigate this.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Curt Tigges

This was an excellent use case for the probing/steering methodology, and seems to show quite useful results! Seems like a method that could be put into place in production relatively easily for both activation monitoring and steering.

Narmeen

Constructive critique:

Strength:

Cool problem to investigate

Shows that models tend to prefer their own responses which have a direct application to the judge selection for routing.

Also finds a steering vector that effectively debiases the judge showing that the bias is rectifiable.

Weaknesses:

Not sure if debiasing would work across models, a more varied model set would reveal that.

Future directions:

I would test if this preference is just towards the model’s own responses or does it extend to responses of models in the same family. This is likely to have more applications.

Testing steering against fine tuning the judge to debias/ measure how effective the steering is and also how universal it is across model families. This would make a good paper.

Quantify the bias in routing decisions/ accuracy.

Anosha Rahim

You are showing how self-preference bias can be measured and mechanistically steered which is a meaningful, actionable insight. Would be curious to see this implemented in a router pipeline for a broader range of tasks. kudos team.

Cite this work

@misc {

title={

@misc {

},

author={

Matthew Nguyen, Simon Fu, Matthew Bozoukov, Dani Roytburg, Jou Barzdukas

},

date={

6/2/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

Mar 31, 2025

Model Models: Simulating a Trusted Monitor

We offer initial investigations into whether the untrusted model can 'simulate' the trusted monitor: is U able to successfully guess what suspicion score T will assign in the APPS setting? We also offer a clean, modular codebase which we hope can be used to streamline future research into this question.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Read More

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.