Jun 2, 2025

Mechanistically Eliciting Misjudgements in Large Language Models

Jedrzej Kolbert, Tommy Xie

Details

Details

Arrow
Arrow
Arrow
Arrow
Arrow

Large Language Models (LLMs) are being increasingly used in evaluator roles, called LLMs-as-a-Judge. This increases the importance of robustness. We investigate a technique for mechanistically eliciting latent behaviour called Deep Causal Transcoding (DCT) on Qwen2.5-3B to give incorrect evaluations. Our results show that in contexts like language and grammar, this method effectively finds directions that can be used to steer the model toward making wrong judgments. However, the same methods could not elicit this misjudgment in contexts like safety. Our work shows that, although convenient, LLMs judgments have the potential to be severely misguided. Code is available at https://github.com/mshahoyi/dct.

Cite this work:

@misc {

title={

@misc {

},

author={

Jedrzej Kolbert, Tommy Xie

},

date={

6/2/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.