Keep Apart Research Going: Donate Today
Summary
Are bad traits in models correlated, as suggested by recent work on emergent misalign-
ment? To investigate this, we fine-tune models on a subset of “dark patterns”, such as
anthropomorphization and sycophancy, and then evaluate their behavior on other dark pat-
terns such as scheming and alignment faking. We find that the limited fine-tuning we do
is enough to induce other problematic tendencies in the model. This effect is particularly
strong in the case of alignment faking which we almost never detect in our base models but
is very easy to induce in our fine-tuned models.
Cite this work:
@misc {
title={
Dark Patterns and Emergent Alignment-Faking
},
author={
Isabel Dahlgren
},
date={
4/7/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}