Do models really internalize our preferences?

Do models really internalize our preferences?

Apart Research's newest paper (alongside academics from the University of Oxford, Cambridge, and Cynch.ai) looks at whether models actually internalize human preferences or not. But why does this matter? Because if an LLM’s behavior diverges from human feedback, unintended consequences may arise.

September 27, 2024

This paper has been accepted to NeurIPS' 2024 Conference.

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) was pioneered by OpenAI researchers in the late 2010s. RLHF is used to train and somewhat ‘align’ large language models (LLMs) by constraining the answers they give to human users.

In spite of RLHF’s apparent success with OpenAI’s ChatGPT - it is unclear whether the LLMs accurately learn the underlying preferences in human feedback data.

Introducing Our Novel Concept: Learned Feedback Patterns

And so understanding Learning Feedback Patterns (LFPs) - an original concept conceived for this work - is crucial to addressing AI safety concerns. LFPs are the activation patterns that models learn to better respond in kind to human preference. However, we aren't sure if these LFPs truly align LLMs to human preferences.

If LFPs can indeed be interpreted and used to understand how closely an LLM's activations match human feedback then this may help minimize risks associated with misaligned LLM behavior.

Does this matter? Is it a concern if LLMs learn human preferences? Do the LLMs learn from RLHF what is right and wrong? Or is something else happen here? As LLMs become more complex and integrated into critical systems, understanding their internal workings will become increasingly important - not just for optimizing performance, but for ensuring they are used responsibly and safely.

Introducing the Paper

Here at Apart, our researchers in conjunction with others, set to find out if LLMs accurately learn these underlying preferences in human feedback data or not. They hypothesized that ‘LLMs with LFPs accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF'.

By fitting probes to feedback signals implicit in the activations of fine-tuned LLMs, their methods help to understand LFPs. This can help minimize discrepancies between LLM behavior and training objectives, which is essential for the safety and reliability of LLMs.

We then validate our probes by comparing the features most active when they predict strong positive feedback against the predictions of GPT-4 as to whether or not a feature is related to the LFPs of a fine-tuned LLM.

Importance for AI Safety

Understanding LFPs is crucial to addressing AI safety concerns. We have no idea how to measure how well a model actually learned to respond according preference data, and this paper suggests a way to measure this divergence.

This paper presents methods to measure the difference between an LLM's activations and human feedback by analyzing LFPs. These can be interpreted and used to understand how closely an LLM's activations match human feedback which may help minimize risks associated with misaligned LLM behavior.

As LLMs become more complex and integrated into critical systems, understanding their internal workings will become increasingly important - not just for optimizing performance, but for ensuring they are used responsibly and safely.

* * *

Read the full paper here.

Authored by Luke Marks (Apart Research Fellow), Amir Abdullah (Cynch.ai, Apart Research Fellow), Fazl Barez (Apart Research Advisor) and Clement Neo (Apart Research Assistant), Rauno Arike (Apart), David Krueger (Cambridge University), and Phillip Torr (Department of Engineering Sciences, University of Oxford).

Dive deeper