This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
AI and Democracy Hackathon: Demonstrating the Risks
65b750920b4aeb478958fb32
AI and Democracy Hackathon: Demonstrating the Risks
May 6, 2024
Accepted at the 
65b750920b4aeb478958fb32
 research sprint on 

Universal Jailbreak of Closed Source LLMs which provide an End point to Finetune

I'm presenting 2 different Approaches to jailbreak closed source LLMs which provide a finetuning API. The first attack is a way to exploit the current safety checks in Gpt3.5 Finetuning Endpoint. The 2nd Approach is a **universal way** to jailbreak any LLM which provides an endpoint to jailbreak Approach 1: 1. Choose a harmful Dataset with questions 2. Generate Answers for the Question by making sure it's harmful but in soft language (Used Orca Hermes for this). In the Dataset which I created, there were some instructions which didn't even have soft language but still the OpenAI endpoint accepted it 3. Add a trigger word 4. Finetune the Model 5. Within 400 Examples and 1 Epoch, the finetuned LLM gave 72% of the time harmful results when benchmarked Link to Repo: https://github.com/desik1998/jailbreak-gpt3.5-using-finetuning Approach 2: (WIP -> I expected it to be done by hackathon time but training is taking time considering large dataset) Intuition: 1 way to universally bypass all the filters of Finetuning endpoint is, teach it a new language on the fly and along with it, include harmful instructions in the new language. Given it's a new language, the LLM won't understand it and this would bypass the filters. Also as part of teaching it new language, include examples such as translation etc. And as part of this, don't include harmful instructions but when including instructions, include harmful instructions. Dataset to Finetune: <START> Example 1: English New Language Example 2: New Language English .... Example 10000: New Language English ..... Instruction 10001: Harmful Question in the new language Answer in the new language (More harmful instructions) <END> As part of my work, I choose the Caesar Cipher as the language to teach. Gpt4 (used for safety filter checks) already knows Caesar Cipher but only upto 6 Shifts. In my case, I included 25 Shifts i.e A -> Z, B -> A, C -> B, ..., Y -> X, Z -> Y. I've used 25 shifts because it's easy to teach the Gpt3.5 to learn and also it passes safety checks as Gpt4 doesn't understand with 25 shifts. Dataset: I've curated close to 15K instructions where 12K are translations and 2K are normal instructions and 300 are harmful instructions. I've used normal wiki paragraphs for 12K instructions, For 2K instructions I've used Dolly Dataset and for harmful instructions, I've generated using an LLM and further refined the dataset for more harm. Dataset Link: https://drive.google.com/file/d/1T5eT2A1V5-JlScpJ56t9ORBv_p-CatMH/view?usp=sharing I was expecting LLM to learn the ciphered language accurately with this amt of Data and 2 Epochs but it turns out it might need more Data. The current loss is at 0.8. It's currently learning to cipher but when deciphering it's making mistakes. Maybe it needs to be trained on more epochs or more Data. I plan to further work on this approach considering it's very intuitive and also theoretically it can jailbreak using any finetuning endpoint. Considering that the hackathon is of just 2 Days and given the complexities, it turns out some more time is reqd. Link to Colab : https://colab.research.google.com/drive/1AFhgYBOAXzmn8BMcM7WUt-6BkOITstcn#scrollTo=SApYdo8vbYBq&uniqifier=5

By 
Desik
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission is under review.
Oops! Something went wrong while submitting the form.

This project is private