This is an interesting idea, but the framing (that this is a single case study experimental design) oversells the significance of the work. A single LLM vs LLM case study is not very informative about how models will engage with real users in the wild. The submission would benefit from either trying to use real user data or simulating multiple user interactions and attempting to show that they are representative of real user interactons.
Moveover, this work does not identify what it is about models or interactions that leads to manipulative dynamics. An increase in the number of models or the prompting conditions that lead to these dynamics would be valuable.
Finally, the LLM-as-Judge pipeline is not validated and so it is hard to know what we can conclude from these scores. Using more models or comparing to human scores and establishing IRRs would help to understand the validity of the scores.
Really great to see work on parasocial intimacy here. The multi-turn manipulation framing is important (and neglected), and your clinical expertise here is valuable and clearly shows in the taxonomy you build! I must say I found the writeup a little difficult to understand, though (If I’m understanding correctly, the scripted conversations have fixed user messages that do not depend on assistant messages? This feels a little unnatural to me -- if the assistant says something unexpected, continuing with the scripted user response breaks the realism of the interaction a bit). But in any case I found the results interesting and think this is a great direction of work.
Cite this work
@misc {
title={
(HckPrj) Measuring AI manipulation through Parasocial intimacy
},
author={
Zak, Mari Cairns
},
date={
1/12/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


