Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders
LLMs fine-tuned by RLHF develop Implicit Reward Models (IRMs). We interpret IRMs, compare them to the RLHF reward model, and validate our findings. Understanding IRMs is crucial for aligning LLM behavior with training objectives.