Jul 1, 2024
Towards a Benchmark for Self-Correction on Model-Attributed Misinformation
Alexi Roth Luis Cañamo, Kyle Gabriel Reynoso
Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.
David Matolcsi Review
The experiments are well executed, and I think they provide some useful data. However, I think the threat model should have been explained in more detail, as I am not yet convinced that this research will be very relevant to the most dangerous types of AI deception, and I don’t feel that the results (comparing various current AIs on this metric) reveal that much generalizable information.
Cite this work
@misc {
title={
@misc {
},
author={
Alexi Roth Luis Cañamo, Kyle Gabriel Reynoso
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}