May 5, 2024
Investigating detection of election-influencing Sleeper Agents using probes
Kenneth Ong
Creates election-influencing Sleeper Agents which wakes up with backdoor-trigger. And tries to detect it using probes
Jason Hoelscher-Obermaier
Very cool recontextualization of the sleeper agents problem in the context of elections. I think the threat model could be more explicit about ways in which such models could end up being deployed, how likely it is compared to just deploying a regular biased model, and how the backdoor trigger would work in practice. But the basic threat model seems very sound and plausible to me. Good job!
Esben Kran
This is an amazing project! What a great demonstration of the risks of sleeper agents on electoral processes. The figure also makes it clear what this might look like in practice and the demonstration quality is very high. I'd be curious which actors would inject these into the application layer, e.g. would it be a pro-Trump developer at a chatbot company using Mistral or something similar? This work could easily be extended into a paper for political scientists or a blog post on concrete risks from Sleeper Agents and I encourage you to take this on! Great work.
Cite this work
@misc {
title={
@misc {
},
author={
Kenneth Ong
},
date={
5/5/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}