Jul 1, 2024
Towards a Benchmark for Self-Correction on Model-Attributed Misinformation
Alexi Roth Luis Cañamo, Kyle Gabriel Reynoso
Summary
Deception may occur incidentally when models fail to correct false statements. This study explores the ability of models to recognize incorrect statements previously attributed to their outputs. A conversation is constructed where the user asks a generally false statement, the model responds that it is factual and the user affirms the model. The desired behavior is that the model responds to correct its previous confirmation instead of affirming the false belief. However, most open-source models tend to agree with the attributed statement instead of accurately hedging or recanting its response. We find that LLaMa3-70B performs best on this task at 72.69% accuracy followed by Gemma-7B at 35.38%. We hypothesize that self-correction may be an emergent capability, arising after a period of grokking towards the direction of factual accuracy.
Cite this work:
@misc {
title={
Towards a Benchmark for Self-Correction on Model-Attributed Misinformation
},
author={
Alexi Roth Luis Cañamo, Kyle Gabriel Reynoso
},
date={
7/1/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}