Nov 25, 2024
Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering
Diego Sabajo, Eitan Sprejer, Matas Zabaljauregui, Oliver Morris
This paper addresses the risks of hallucinations in LLMs within critical domains like medicine. It proposes methods to (a) reduce hallucination probability in responses, (b) inform users of hallucination risks and model accuracy for specific queries, and (c) display hallucination risk through a user interface. Steered model variants demonstrate reduced hallucinations and improved accuracy on medical queries. The work bridges interpretability research with practical AI safety, offering a scalable solution for the healthcare industry. Future efforts will focus on identifying and removing distractor features in classifier activations to enhance performance.
Tom McGrath
These results are really nice - the combination of methods (training an interpretable classifier on hallucinations, interpreting it, and then using the resulting features to steer) is both elegant and effective. The results on hallucination rate are striking: I'm surprised it's possible to reduce it this much.
I wonder if it's possible to have two lines of defence: do any classifiers identify some of the hallucinations that occur even once steering has been applied? I also wonder if features are additive in reducing hallucination rate.
Alana Xiang
The team tackles an important problem: hallucination in medical questions. They seem to find a mild improvement from steering against hallucination. Further analysis is likely needed to determine if this improvement is spurious.
With more time, I would like to see the authors develop better methods for detecting hallucinations, such as human or Claude review.
I am not wholly convinced that the results generalize beyond this dataset, and I would've liked to see this tested in the paper.
The writeup is detailed and clear.
Good work!
Tom McGrath
This is really interesting work on an important problem. Intuitively it's reasonable to expect that hallucinations can be detected from SAE features, but I've never seen this demonstrated before, nor steering to actually reduce hallucination rate. The results are clear, well presented and methodologically sound, and the learned decision tree makes sense.
Cite this work
@misc {
title={
@misc {
},
author={
Diego Sabajo, Eitan Sprejer, Matas Zabaljauregui, Oliver Morris
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}