Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders
We use sparse autoencoders to compare RLHF and base model features
No items found.
Paper
Code
Explainer
An interview with
"
Interpreting reward models learned through RLHF in LLMs
" was written by
during the
Apart Lab Fellowship
Author contribution
No items found.
Citation
Send feedback
Name
Your Email Address
Your Message or Feedback
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Media kit
Quotes
No items found.
All figures
No items found.