Jun 2, 2025
Mechanistically Eliciting Misjudgements in Large Language Models
Jedrzej Kolbert, Tommy Xie
Summary
Large Language Models (LLMs) are being increasingly used in evaluator roles, called LLMs-as-a-Judge. This increases the importance of robustness. We investigate a technique for mechanistically eliciting latent behaviour called Deep Causal Transcoding (DCT) on Qwen2.5-3B to give incorrect evaluations. Our results show that in contexts like language and grammar, this method effectively finds directions that can be used to steer the model toward making wrong judgments. However, the same methods could not elicit this misjudgment in contexts like safety. Our work shows that, although convenient, LLMs judgments have the potential to be severely misguided. Code is available at https://github.com/mshahoyi/dct.
Cite this work:
@misc {
title={
@misc {
},
author={
Jedrzej Kolbert, Tommy Xie
},
date={
6/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}