Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

LLMs fine-tuned by RLHF develop Implicit Reward Models (IRMs). We interpret IRMs, compare them to the RLHF reward model, and validate our findings. Understanding IRMs is crucial for aligning LLM behavior with training objectives.

Paper

Code

Explainer

An interview with

Interpreting reward models learned through RLHF in LLMs

" was written by

during the Apart Lab Fellowship

Author contribution

No items found.

Citation

Send feedback

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Media kit

Quotes

No items found.

All figures

No items found.