Mar 30, 2025
Deceptive AI: A New Control Setting for Human Manipulation in Decision-Making Environments
Cheryl (Jingxing) Luo, Catherine Jiaxin Wang
Summary
We lay out a threat model and create a new control setting in which a misaligned, adversarial AI tries to persuade a human user to make suboptimal or harmful choices. We provide a proof of theory for this control setting through preliminary testing and red teaming. This new control setting will allow new control protocols to be developed and evaluated, specifically targeted at reducing persuasion risks.
Cite this work:
@misc {
title={
Deceptive AI: A New Control Setting for Human Manipulation in Decision-Making Environments
},
author={
Cheryl (Jingxing) Luo, Catherine Jiaxin Wang
},
date={
3/30/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}