This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Reprogramming AI Models Hackathon
6710eab8447f62cdea3a653c
Reprogramming AI Models Hackathon
November 25, 2024
Accepted at the 
6710eab8447f62cdea3a653c
 research sprint on 

Utilitarian Decision-Making in Models - Evaluation and Steering

We design an eval based on the Oxford Utilitarianism Scale (OUS) that measures the model’s deontological vs utilitarian preference in a nine question, two factor model. Using this scale, we measure how feature steering can alter this preference, and the ability of the Llama 70B model to impersonate other people’s opinions. Results are validated against a dataset of 10,000 human responses. We find that (1) Llama does not accurately capture human moral values, (2) OUS offers better interpretations than current feature labels, and (3) Llama fails to predict the demographics of human values.

By 
Adam Newgas, Sinem Erisken , Pandelis Mouratoglou
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private