Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

We use sparse autoencoders to compare RLHF and base model features

"

Interpreting reward models learned through RLHF in LLMs

" was written by

Author contribution

No items found.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

No items found.

No items found.