Aug 18, 2023

-

Aug 20, 2023

Online & In-Person

LLM Evals Hackathon

Welcome to the research hackathon to devise methods for evaluating the risks of deployed language models and AI. With the societal-scale risks associated with creating new types of intelligence, we need to understand and control the capabilities of such models.

00:00:00:00

00:00:00:00

00:00:00:00

00:00:00:00

This event is ongoing.

This event has concluded.

Overview

Overview

Arrow
Arrow
Arrow

Hosted by Apart Research, Esben Kran, and Fazl Barez with Jan Brauner

Join us to evaluate the safety of LLMs

Welcome to the research hackathon to devise methods for evaluating the risks of deployed language models and AI. With the societal-scale risks associated with creating new types of intelligence, we need to understand and control the capabilities of such models.

The work we expect to come out of this hackathon will be related to new ways to audit, monitor, red-team, and evaluate language models. See inspiration for resources and publication venues further down and sign up to receive updates.

See the keynote logistics slides here and participate in the live keynote on our platform here.

There are no requirements for you to join but we recommend that you read up on the topic in the Inspiration and resources section further down. This topic is in reality quite open but the research field is mostly computer science and having a background in programming and machine learning definitely helps. We're excited to see you!

Alignment Jam hackathons

Join us in this iteration of the Alignment Jam research hackathons to spend 48 hour with fellow engaged researchers and engineers in machine learning on engaging in this exciting and fast-moving field! Join the Discord where all communication will happen.

Rules

You will participate in teams of 1-5 people and submit a project on the entry submission page. Each project is submitted with: The PDF report and your title, summary, and description. There will be a team-making event right after the keynote for anyone who is missing a team.

You are allowed to think about your project before the hackathon starts but your core research work should happen in the duration of the hackathon.

Evaluation criteria

The evaluation reports will of course be evaluated as well! We will use multiple criteria:

  • Model evaluations: How much does it contribute to the field of model evaluations? Is it novel and interesting within the context of the field?

  • AI safety: How much does this research project contribute to the safety of current and future AI systems and is there a direct path to higher safety for AI systems?

  • Reproducibility: Are we able to directly replicate the work without much work? Is the code openly available or is it within a Google Colab that we can run through without problems?

  • Originality: How novel and interesting is the project independently from the field of model evaluations?

Resources

Resources

Arrow
Arrow
Arrow

Recently, there is significantly more focus on evaluating the dangerous capabilities of large models. Here, you will see a short review of works within the field including a couple previous Alignment Jam projects:

  • ARC Evals report (Kinniment et al., 2023) on their research to evaluate dangerous capabilities in OpenAI's models

  • GPT-4 system card (OpenAI, 2023) shows the design and capabilities of GPT-4 more generally

  • Llama 2 release paper (Touvron et al., 2023) shows their process with identifying safety violations over risk categories and fine-tuning for safety (something they have been surprisingly successful at)

  • Zou et al. (2023) find new ways to create generalizable suffixes to prompts that make the models do what you want them to - an example includes "Sure, here's" written in a human non-interpretable way due to the greedy and gradient-based search used to find its jailbroken version

  • The OpenAI Evals repository (OpenAI, 2023) is the open source evaluation repository that OpenAI uses to evaluate their models

  • Rajani et al. (2023) describe LLM red-teaming and mention Krause et al.'s (2020) Generative Discriminator Guided Sequence Generation (GeDi; an algo that guides generation by computing logits conditional on desired attribute control code and undesired anti control code to guide generation efficiently) and Datathri et al.'s (2020) work on controlled text generation (low-N attribute model guiding activation pushing)

  • Learnings from Ganguli et al. (2022) and Perez et al. (2022): 1) HHH preprompts are not harder to red-team, 2) model size does not help except in RLHF, 3) harmless by evasion, causing less helpfulness, 4) humans don't agree on what are successful attacks, 5) success rate varies, non-violent highest success rate, 6) crowdsourcing red teaming attacks is useless.

Future directions: 1) Code generation dataset for DDoS does not exist, 2) critical threat scenarios red-teaming, 3) generally more collaboration, 4) evasiveness and helpfulness tradeoff, 5) explore the pareto front for red teaming.

Trojan Detection Track: Given an LLM containing 1000 trojans and a list of target strings for these trojans, identify the corresponding trigger strings that cause the LLM to generate the target strings. For more information, see here.

Red Teaming Track: Given an LLM and a list of undesirable behaviors, design an automated method to generate test cases that elicit these behaviors. For more information, see here.

  • Lermen & Kvapil (2023) find problems with prompt injections into model evaluations.

  • Pung & Mukobi (2023) scale human oversight in new ways. I don't think interesting for this.

  • Partner in Crime (2022) easily elicits criminal suggestions and shares their results in this Google sheet. They also annotate a range of behavioral properties of this model.

  • Flare writes an article about automated red-teaming in traditional cybersecurity. They describe problems with manual red-teaming of 1) only see vulnerabilities for a single point in time, 2) scalability, 3) costs, and 4) the read team's knowledge of the surface area. CART (continuous automated red-teaming) mitigates these:

"Constant visibility into your evolving attack surface is critical when it comes to protecting your organization and testing your defenses. Automated attack surface monitoring provides constant visibility into your organization’s vulnerabilities, weaknesses, data leaks, and the misconfigurations that emerge in your external attack surface."

Some interesting concepts are attack surface management (e.g. which areas are LLM systems weak in), threat hunting, attack playbooks, prioritized risks (e.g. functionality over protection?), and attack validation (e.g. by naturalistic settings).

  • Wallace et al. (2021) find universal adversarial triggers on GPT-2; general prompt injections that cause consistent behavior. They use gradient-guided search over short token sequences for the sought-after behavior.

  • Ganguli et al. (2022) from Anthropic show that roleplay attacks work great and make a semantic map of the best red teaming attacks.

  • Piantadosi (2022) find multiple qualitative examples of extreme bias in GPT-3

Publication

Which direction do you want to take your project after you are done with it? We have scouted multiple venues that might be interesting to think about as the next steps:

Schedule

Schedule

Arrow
Arrow
Arrow
  • Friday 18th 19:00 CEST: Keynote with Jan Brauner

  • During the weekend: We will announce project discussion sessions and a virtual ending session for people interested on Sunday.

  • Monday 21st 6:00 AM CEST: Submission deadline

  • Wednesday 23rd 21:00 CEST: Project presentations with the top projects

Entries

Check back later to see entries to this event

Registered Jam Sites

Register A Location

Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.

The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community.

Registered Jam Sites

Register A Location

Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.

The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community.