May 6, 2024
Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.
Mayowa Osibodu
The unprecedented scale of disinformation campaigns possible today, poses serious risks to society and democracy.
It turns out however, that equipping LLMs to precisely identify misinformation in digital content (presumably with the intention of countering it), provides them with an increased level of sophistication which could be easily leveraged by malicious actors to amplify that misinformation.
This study looks into this unexpected phenomenon, discusses the associated risks, and outlines approaches to mitigate them.
Simon Lermen
Would be interesting to add references to the RAG information source. In the black-hat mode it could also find reasonable but misleading sources. for example, there are a ton of studies basically proving everything is deadly(in mouse or so). Such studies were often used by vaccine truthers for example.
Older but similar: https://arxiv.org/pdf/2207.06220
Nina Rimsky
I liked the way you showed how the same tool can be used for offense and defense. Would be interesting to further develop the misinformation-detection tool as I can see this being useful if deployed responsibly.
Konrad Seifert
I like this—very much like community notes, hopefully more scalable. Seems generally useful.
Jason Hoelscher-Obermaier
I like this idea a lot, and in particular the juxtaposition of white-hat and black-hat mode. It would be great to explore quantitatively how much effect such a tool would have on public posts of users on social media and it seems like a very worthwhile experiment to run.
Esben Kran
This is an awesome project (and equally great title) — showcasing the dual use already in these pilot experiments is great foresight. An obvious next step is to assess the accuracy of comments, though ingesting directly from Wikipedia with RAG seems like a pretty robust process. I'd be curious about some extra work on identifying the most cost-effective ways to implement this at scale, e.g. can use 1) message length and keywords in switch statements to funnel into 2) where we use a clustering model for {factual_statement, non_factual_statement} into 3) full white-hat bot response generation into 4) evaluation of response into 5) posting of response. And might we be able to fine-tune it on the Twitter birdwatch project as well (Birdwatch). Wonderful work!
Cite this work
@misc {
title={
@misc {
},
author={
Mayowa Osibodu
},
date={
5/6/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}