May 6, 2024

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

Mayowa Osibodu

The unprecedented scale of disinformation campaigns possible today, poses serious risks to society and democracy.

It turns out however, that equipping LLMs to precisely identify misinformation in digital content (presumably with the intention of countering it), provides them with an increased level of sophistication which could be easily leveraged by malicious actors to amplify that misinformation.

This study looks into this unexpected phenomenon, discusses the associated risks, and outlines approaches to mitigate them.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Would be interesting to add references to the RAG information source. In the black-hat mode it could also find reasonable but misleading sources. for example, there are a ton of studies basically proving everything is deadly(in mouse or so). Such studies were often used by vaccine truthers for example.

Older but similar: https://arxiv.org/pdf/2207.06220

I liked the way you showed how the same tool can be used for offense and defense. Would be interesting to further develop the misinformation-detection tool as I can see this being useful if deployed responsibly.

I like this—very much like community notes, hopefully more scalable. Seems generally useful.

I like this idea a lot, and in particular the juxtaposition of white-hat and black-hat mode. It would be great to explore quantitatively how much effect such a tool would have on public posts of users on social media and it seems like a very worthwhile experiment to run.

This is an awesome project (and equally great title) — showcasing the dual use already in these pilot experiments is great foresight. An obvious next step is to assess the accuracy of comments, though ingesting directly from Wikipedia with RAG seems like a pretty robust process. I'd be curious about some extra work on identifying the most cost-effective ways to implement this at scale, e.g. can use 1) message length and keywords in switch statements to funnel into 2) where we use a clustering model for {factual_statement, non_factual_statement} into 3) full white-hat bot response generation into 4) evaluation of response into 5) posting of response. And might we be able to fine-tune it on the Twitter birdwatch project as well (Birdwatch). Wonderful work!

Cite this work

@misc {

title={

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

},

author={

Mayowa Osibodu

},

date={

5/6/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.