Nov 25, 2024

Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering

Diego Sabajo, Eitan Sprejer, Matas Zabaljauregui, Oliver Morris

This paper addresses the risks of hallucinations in LLMs within critical domains like medicine. It proposes methods to (a) reduce hallucination probability in responses, (b) inform users of hallucination risks and model accuracy for specific queries, and (c) display hallucination risk through a user interface. Steered model variants demonstrate reduced hallucinations and improved accuracy on medical queries. The work bridges interpretability research with practical AI safety, offering a scalable solution for the healthcare industry. Future efforts will focus on identifying and removing distractor features in classifier activations to enhance performance.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

These results are really nice - the combination of methods (training an interpretable classifier on hallucinations, interpreting it, and then using the resulting features to steer) is both elegant and effective. The results on hallucination rate are striking: I'm surprised it's possible to reduce it this much.

I wonder if it's possible to have two lines of defence: do any classifiers identify some of the hallucinations that occur even once steering has been applied? I also wonder if features are additive in reducing hallucination rate.

The team tackles an important problem: hallucination in medical questions. They seem to find a mild improvement from steering against hallucination. Further analysis is likely needed to determine if this improvement is spurious.

With more time, I would like to see the authors develop better methods for detecting hallucinations, such as human or Claude review.

I am not wholly convinced that the results generalize beyond this dataset, and I would've liked to see this tested in the paper.

The writeup is detailed and clear.

Good work!

This is really interesting work on an important problem. Intuitively it's reasonable to expect that hallucinations can be detected from SAE features, but I've never seen this demonstrated before, nor steering to actually reduce hallucination rate. The results are clear, well presented and methodologically sound, and the learned decision tree makes sense.

Cite this work

@misc {

title={

Improving Llama-3-8B-Instruct Hallucination Robustness in Medical Q&A Using Feature Steering

},

author={

Diego Sabajo, Eitan Sprejer, Matas Zabaljauregui, Oliver Morris

},

date={

11/25/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.