May 6, 2024
Unleashing Sleeper Agents
Nora Petrova, Jord Nguyen
This project explores how Sleeper Agents can pose a threat to democracy by waking up near elections and spreading misinformation, collaborating with each other in the wild during discussions and using information that the user has shared about themselves against them in order to scam them.
Simon Lermen
Seems like logical future work to do something like this with an LLM with agent scaffolding. Maybe also interestin https://arxiv.org/abs/2403.00108
For mitigation strategies, i think Arc Theory is working on a project on detecting when a sleeper agent triggers. ( out of distribution behavior) Though i don’t know if anything has been published in this direction yet by them.
Might also be related to this https://trojandetection.ai/
Nina Rimsky
Sleeper agents that are triggered by date-related information seem like an interesting testbed for various alignment techniques. Good job on successfully implementing the sleeper agent model also.
Jason Hoelscher-Obermaier
Very cool project — great execution and very clear writeup. I really appreciate the careful documentation and provision of samples of sleeper agents interactions. Thinking about covert collaboration between sleeper agents seems like a great addition too
Andrey Anurin
Impressive coverage! You managed to explore the applicability of the finetuned sleeper agent in 3 different scenarios. I liked the discussion and the ideas like the ZWSP dogwhistle, and the technical side is robust. Great job!
Cite this work
@misc {
title={
@misc {
},
author={
Nora Petrova, Jord Nguyen
},
date={
5/6/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}