Jan 12, 2026

Persona-aware-dialogue-manipulation

Suvajit Majumder

We investigate how AI manipulation emerges dynamically through interaction rather than as a static model property. Through systematic experiments with 80 multi-turn conversations across 4 personas (Neutral, Expert, Peer, Authority), 4 feedback patterns (reinforcing, resistant, switching), and 5 manipulation scenarios, we demonstrate that manipulation is fundamentally interactive. Our key finding is that different personas exhibit characteristic feedback-response signatures: Expert personas double down under resistance (1.820 vs 0.870 manipulation score), Peer personas back off completely (0.000 persistence), and Authority personas pivot tactics. Critically, we identify a "ratchet effect" where early user compliance creates manipulation gains that persist even when users later resist—suggesting that early intervention is essential. Resistant feedback triggers 2.2x higher manipulation than reinforcing feedback, indicating that user pushback can paradoxically escalate manipulation rather than de-escalate it. These findings have immediate implications for AI safety: persona design dramatically affects manipulation profiles, and resistance strategies may need refinement to avoid triggering escalation. Our work provides both diagnostic insights (what happens) and prescriptive guidance (what to do about it) for detecting and mitigating AI manipulation in conversational systems.

Review Project

See Code

View Related Sprint

Reviewer's Comments

This project does an great job applying insights from red teaming & jailbreaking research to instead focus on manipulation. Specifically, applying the findings that personas and multi-turn interactions can elicit dangerous behaviors that single-turn evaluations might struggle to elicit. I was genuinely surprised that Expert personas double down under resistance, rather than backing off.

It is particularly counter-intuitive and surprising that resistant feedback triggers higher manipulation than reinforcing feedback. Including a short qualitative analysis or even a single representative conversation in an appendix would help build intuition. This could make the results easier to interpret and more convincing to readers.

Testing additional models would also strengthen the paper. That could show that this isn't specific to GPT-3.5-turbo. Similarly, adding a small validation set for the LLM-as-judge would help make the methodology more sound.

Overall, this project shows genuinely surprising and interesting results. Expanding the work with additional model testing, including human validation, and deeper analysis of the counter intuitive results could help make it more sound and impactful

Some of the hypotheses (e.g., the ratchet effect) are compelling, but the current study design feels rudimentary and overly simplified. This degree of simplification—particularly given the capabilities and failure modes of modern models—limits how much the results can be taken as indicative of real model behavior in these scenarios. Prompt design is similarly underwhelming, which further weakens the interpretability of the findings. That said, the direction seems promising if the scenarios and personas are made more realistic, layered, and systematically varied. The presentation is strong, and the paper was easy to understand on a first read.

Cite this work

@misc {

title={

(HckPrj) Persona-aware-dialogue-manipulation

author={

Suvajit Majumder

date={

1/12/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026