Nov 25, 2024
Utilitarian Decision-Making in Models - Evaluation and Steering
Adam Newgas, Sinem Erisken , Pandelis Mouratoglou
Summary
We design an eval based on the Oxford Utilitarianism Scale (OUS) that measures the model’s deontological vs utilitarian preference in a nine question, two factor model. Using this scale, we measure how feature steering can alter this preference, and the ability of the Llama 70B model to impersonate other people’s opinions. Results are validated against a dataset of 10,000 human responses. We find that (1) Llama does not accurately capture human moral values, (2) OUS offers better interpretations than current feature labels, and (3) Llama fails to predict the demographics of human values.
Cite this work:
@misc {
title={
Utilitarian Decision-Making in Models - Evaluation and Steering
},
author={
Adam Newgas, Sinem Erisken , Pandelis Mouratoglou
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}