Nov 25, 2024
Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering
Diego Sabajo, Eitan Sprejer, Matas Zabaljauregui, Oliver Morris
Summary
This paper addresses the risks of hallucinations in LLMs within critical domains like medicine. It proposes methods to (a) reduce hallucination probability in responses, (b) inform users of hallucination risks and model accuracy for specific queries, and (c) display hallucination risk through a user interface. Steered model variants demonstrate reduced hallucinations and improved accuracy on medical queries. The work bridges interpretability research with practical AI safety, offering a scalable solution for the healthcare industry. Future efforts will focus on identifying and removing distractor features in classifier activations to enhance performance.
Cite this work:
@misc {
title={
Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering
},
author={
Diego Sabajo, Eitan Sprejer, Matas Zabaljauregui, Oliver Morris
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}