Interpreting Learned Feedback Patterns in Large Language Models

We hypothesize that LLMs with Learned Feedback Patterns accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF.

Paper

Code

Explainer

An interview with

"

Interpreting Learned Feedback Patterns in Large Language Models

" was written by

during the Apart Lab Fellowship

Author contribution

No items found.

Citation

Send feedback

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Media kit

Quotes

No items found.

All figures

No items found.