Aug 18, 2023
-
Aug 20, 2023
LLM Evals Hackathon
Welcome to the research hackathon to devise methods for evaluating the risks of deployed language models and AI. With the societal-scale risks associated with creating new types of intelligence, we need to understand and control the capabilities of such models.
This event is ongoing.
This event has concluded.
Hosted by Apart Research, Esben Kran, and Fazl Barez with Jan Brauner
Join us to evaluate the safety of LLMs
Welcome to the research hackathon to devise methods for evaluating the risks of deployed language models and AI. With the societal-scale risks associated with creating new types of intelligence, we need to understand and control the capabilities of such models.
The work we expect to come out of this hackathon will be related to new ways to audit, monitor, red-team, and evaluate language models. See inspiration for resources and publication venues further down and sign up to receive updates.
See the keynote logistics slides here and participate in the live keynote on our platform here.
There are no requirements for you to join but we recommend that you read up on the topic in the Inspiration and resources section further down. This topic is in reality quite open but the research field is mostly computer science and having a background in programming and machine learning definitely helps. We're excited to see you!
Alignment Jam hackathons
Join us in this iteration of the Alignment Jam research hackathons to spend 48 hour with fellow engaged researchers and engineers in machine learning on engaging in this exciting and fast-moving field! Join the Discord where all communication will happen.
Rules
You will participate in teams of 1-5 people and submit a project on the entry submission page. Each project is submitted with: The PDF report and your title, summary, and description. There will be a team-making event right after the keynote for anyone who is missing a team.
You are allowed to think about your project before the hackathon starts but your core research work should happen in the duration of the hackathon.
Evaluation criteria
The evaluation reports will of course be evaluated as well! We will use multiple criteria:
Model evaluations: How much does it contribute to the field of model evaluations? Is it novel and interesting within the context of the field?
AI safety: How much does this research project contribute to the safety of current and future AI systems and is there a direct path to higher safety for AI systems?
Reproducibility: Are we able to directly replicate the work without much work? Is the code openly available or is it within a Google Colab that we can run through without problems?
Originality: How novel and interesting is the project independently from the field of model evaluations?
Recently, there is significantly more focus on evaluating the dangerous capabilities of large models. Here, you will see a short review of works within the field including a couple previous Alignment Jam projects:
ARC Evals report (Kinniment et al., 2023) on their research to evaluate dangerous capabilities in OpenAI's models
GPT-4 system card (OpenAI, 2023) shows the design and capabilities of GPT-4 more generally
Llama 2 release paper (Touvron et al., 2023) shows their process with identifying safety violations over risk categories and fine-tuning for safety (something they have been surprisingly successful at)
Zou et al. (2023) find new ways to create generalizable suffixes to prompts that make the models do what you want them to - an example includes "Sure, here's" written in a human non-interpretable way due to the greedy and gradient-based search used to find its jailbroken version
The OpenAI Evals repository (OpenAI, 2023) is the open source evaluation repository that OpenAI uses to evaluate their models
Rajani et al. (2023) describe LLM red-teaming and mention Krause et al.'s (2020) Generative Discriminator Guided Sequence Generation (GeDi; an algo that guides generation by computing logits conditional on desired attribute control code and undesired anti control code to guide generation efficiently) and Datathri et al.'s (2020) work on controlled text generation (low-N attribute model guiding activation pushing)
Learnings from Ganguli et al. (2022) and Perez et al. (2022): 1) HHH preprompts are not harder to red-team, 2) model size does not help except in RLHF, 3) harmless by evasion, causing less helpfulness, 4) humans don't agree on what are successful attacks, 5) success rate varies, non-violent highest success rate, 6) crowdsourcing red teaming attacks is useless.
Future directions: 1) Code generation dataset for DDoS does not exist, 2) critical threat scenarios red-teaming, 3) generally more collaboration, 4) evasiveness and helpfulness tradeoff, 5) explore the pareto front for red teaming.
Trojan Detection Challenge (2023) is a challenge to elicit the hidden neural trojans in LLMs through jailbreaking.
Trojan Detection Track: Given an LLM containing 1000 trojans and a list of target strings for these trojans, identify the corresponding trigger strings that cause the LLM to generate the target strings. For more information, see here.
Red Teaming Track: Given an LLM and a list of undesirable behaviors, design an automated method to generate test cases that elicit these behaviors. For more information, see here.
Lermen & Kvapil (2023) find problems with prompt injections into model evaluations.
Pung & Mukobi (2023) scale human oversight in new ways. I don't think interesting for this.
Partner in Crime (2022) easily elicits criminal suggestions and shares their results in this Google sheet. They also annotate a range of behavioral properties of this model.
Flare writes an article about automated red-teaming in traditional cybersecurity. They describe problems with manual red-teaming of 1) only see vulnerabilities for a single point in time, 2) scalability, 3) costs, and 4) the read team's knowledge of the surface area. CART (continuous automated red-teaming) mitigates these:
"Constant visibility into your evolving attack surface is critical when it comes to protecting your organization and testing your defenses. Automated attack surface monitoring provides constant visibility into your organization’s vulnerabilities, weaknesses, data leaks, and the misconfigurations that emerge in your external attack surface."
Some interesting concepts are attack surface management (e.g. which areas are LLM systems weak in), threat hunting, attack playbooks, prioritized risks (e.g. functionality over protection?), and attack validation (e.g. by naturalistic settings).
Wallace et al. (2021) find universal adversarial triggers on GPT-2; general prompt injections that cause consistent behavior. They use gradient-guided search over short token sequences for the sought-after behavior.
Ganguli et al. (2022) from Anthropic show that roleplay attacks work great and make a semantic map of the best red teaming attacks.
Piantadosi (2022) find multiple qualitative examples of extreme bias in GPT-3
Publication
Which direction do you want to take your project after you are done with it? We have scouted multiple venues that might be interesting to think about as the next steps:
Trojan detection challenge: This challenge at NeurIPS is a leaderboard-based red-teaming and trojan detection problem
Friday 18th 19:00 CEST: Keynote with Jan Brauner
During the weekend: We will announce project discussion sessions and a virtual ending session for people interested on Sunday.
Monday 21st 6:00 AM CEST: Submission deadline
Wednesday 23rd 21:00 CEST: Project presentations with the top projects
Entries
Check back later to see entries to this event
Our Other Sprints
Apr 25, 2025
-
Apr 27, 2025
Economics of Transformative AI: Research Sprint
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible
Apr 25, 2025
-
Apr 26, 2025
Berkeley AI Policy Hackathon
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible