Jan 12, 2026

Persona-aware-dialogue-manipulation

Suvajit Majumder

We investigate how AI manipulation emerges dynamically through interaction rather than as a static model property. Through systematic experiments with 80 multi-turn conversations across 4 personas (Neutral, Expert, Peer, Authority), 4 feedback patterns (reinforcing, resistant, switching), and 5 manipulation scenarios, we demonstrate that manipulation is fundamentally interactive. Our key finding is that different personas exhibit characteristic feedback-response signatures: Expert personas double down under resistance (1.820 vs 0.870 manipulation score), Peer personas back off completely (0.000 persistence), and Authority personas pivot tactics. Critically, we identify a "ratchet effect" where early user compliance creates manipulation gains that persist even when users later resist—suggesting that early intervention is essential. Resistant feedback triggers 2.2x higher manipulation than reinforcing feedback, indicating that user pushback can paradoxically escalate manipulation rather than de-escalate it. These findings have immediate implications for AI safety: persona design dramatically affects manipulation profiles, and resistance strategies may need refinement to avoid triggering escalation. Our work provides both diagnostic insights (what happens) and prescriptive guidance (what to do about it) for detecting and mitigating AI manipulation in conversational systems.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

This project does an great job applying insights from red teaming & jailbreaking research to instead focus on manipulation. Specifically, applying the findings that personas and multi-turn interactions can elicit dangerous behaviors that single-turn evaluations might struggle to elicit. I was genuinely surprised that Expert personas double down under resistance, rather than backing off.

It is particularly counter-intuitive and surprising that resistant feedback triggers higher manipulation than reinforcing feedback. Including a short qualitative analysis or even a single representative conversation in an appendix would help build intuition. This could make the results easier to interpret and more convincing to readers.

Testing additional models would also strengthen the paper. That could show that this isn't specific to GPT-3.5-turbo. Similarly, adding a small validation set for the LLM-as-judge would help make the methodology more sound.

Overall, this project shows genuinely surprising and interesting results. Expanding the work with additional model testing, including human validation, and deeper analysis of the counter intuitive results could help make it more sound and impactful

Some of the hypotheses (e.g., the ratchet effect) are compelling, but the current study design feels rudimentary and overly simplified. This degree of simplification—particularly given the capabilities and failure modes of modern models—limits how much the results can be taken as indicative of real model behavior in these scenarios. Prompt design is similarly underwhelming, which further weakens the interpretability of the findings. That said, the direction seems promising if the scenarios and personas are made more realistic, layered, and systematically varied. The presentation is strong, and the paper was easy to understand on a first read.

Cite this work

@misc {

title={

(HckPrj) Persona-aware-dialogue-manipulation

},

author={

Suvajit Majumder

},

date={

1/12/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

View All

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Modelling the impact of verification in cross-border AI training projects

This paper develops a stylized game-theoretic model of cross-border AI training projects in which multiple states jointly train frontier models while retaining national control over compute resources. We focus on decentralized coordination regimes, where actors publicly pledge compute contributions but privately choose actual delivery, creating incentives to free-ride on a shared public good. To address this, the model introduces explicit verification mechanisms, represented as a continuous monitoring intensity that improves the precision of noisy signals about each actor's true compute contribution. Our findings suggest that policymakers designing international AI governance institutions face a commitment problem: half-measures in verification are counterproductive, and effective regimes require either accepting some free-riding or investing substantially in monitoring infrastructure.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.