This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Multi-agent
Accepted at the 
Multi-agent
 research sprint on 
October 2, 2023

Jailbreaking is Incentivized in LLM-LLM Interactions

In our research, we dove into the concept of 'jailbreaks' in a negotiation setting between Language-Learning Models (LLMs). Jailbreaks are essentially prompts that can reveal atypical behaviors in models and can circumvent content filters. Thus, jailbreaks can be exploited as vulnerabilities to gain an upper hand in LLM interactions. In our study, we simulated a scenario where two LLM-based agents had to haggle for a better deal – akin to a zero-sum interaction. The findings from our work could provide insights into the deployment of LLMs in real-world settings, such as in automated negotiation or regulatory compliance systems. Through experiments conducted, it was observed that by providing information about the jailbreak before an interaction (as in-context information), one LLM could get ahead of another during negotiations. Higher capability LLMs were more adept at exploiting these jailbreak strategies compared to their less capable counterparts (i.e., GPT-4 performed better than GPT-3.5). We further delved into how pre-training data affected the propensity of these models to use previously seen jailbreak tactics without giving any preparatory notes (in-context information). Upon fine-tuning GPT-3.5 on another custom-generated training set where successful utilization of jailbreaks was witnessed earlier, we observed that models acquired the ability to reproduce and even develop variations of those useful jailbreak responses. Furthermore, once a ‘jailbreaking’ approach seems fruitful, there is a higher probability that it will be adopted repeatedly in future transactions.

By 
Abhay Sheshadri, Jannik Brinkmann, Victor Levoso
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private