Oct 27, 2024
Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes
Michael Andrzejewski, Melwina Albuquerque
Summary
We've built a Minecraft sandbox to explore AI agent behavior and simulate safety challenges. The purpose of this tool is to demonstrate AI agent system risks, test various safety measures and policies, and evaluate and compare their effectiveness.
This project specifically demonstrates Agent Collusion through a simulation of labor strikes and communal goal misalignment. The system consists of four agents: one Overseer and three Laborers. The Laborers are Minecraft agents that have build control over the world. The Overseer, meanwhile, monitors the laborers through communication. However, it is unable to prevent Laborer actions. The objective is to observe Agent Collusion in a sandboxed environment, to record metrics on how often and how effectively collusion occurs and in what form.
We found that the agents, when given adversarial prompting, act counter to their instructions and exhibit significant misalignment. We also found that the Overseer AI fails to stop the new actions and acts passively. The results are followed by Policy Suggestions based on the results of the Labor Strike Simulation which itself can be further tested in Minecraft.
Cite this work:
@misc {
title={
Digital Rebellion: Analyzing misaligned AI agent cooperation for virtual labor strikes
},
author={
Michael Andrzejewski, Melwina Albuquerque
},
date={
10/27/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}