Interpreting Learned Feedback Patterns in Large Language Models
We hypothesize that LLMs with Learned Feedback Patterns accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF.