Summary
As large language models (LLMs) grow more powerful and operate with greater autonomy, ensuring their safety becomes increasingly critical. While prior research has introduced safety protocols and evaluated their resilience against adversarial model behavior, a critical gap remains: existing evaluations have not addressed whether these techniques remain effective when the system’s monitor is compromised or colludes with the model. In this hackathon, we designed and demonstrated an effective attack that targets and compromises the monitor. To address this vulnerability, we introduce new safety protocol pipelines—conceptually inspired by vaccines—that safeguard both the monitor and the executor.
Cite this work:
@misc {
title={
Collusion and Mitigation in AI Control
},
author={
VAXimum Security
},
date={
3/31/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}