Nov 24, 2023
-
Nov 26, 2023
Online & In-Person
AI Model Evaluations Hackathon




This event has concluded.
48
Sign Ups
Entries
Overview
Resources
Schedule
Entries
Overview

Join the fun 3 day research sprint!
Governments are at a loss as to the risks of AI. Many organizations are interested in understanding where they can deploy AI safely and where they cannot. Come join us for this weekend's effort to uncover the risks and design evaluations for understanding dangerous capabilities of language models!
Watch the keynote live and recorded below
We start the weekend with a livestreamed and recorded keynote talk introducing the topic and introducing the schedule for the weekend. Saturday and Sunday have mentoring sessions (office hours) where we encourage you to show up on the Discord. Wednesday the 29th, we host project presentations where top projects will showcase their results.
Get an introduction to doing evaluations from Apollo Research and see ideas under the "Ideas" tab for what you can spend your weekend on. See more inspiration further down the page along with the judging criteria.
Thank you to our gold sponsor Apollo Research.
There are no requirements for you to join but we recommend that you read up on the topic in the Inspiration and resources section further down.
Alignment Jam hackathons
The Alignment Jam hackathons are research sprints within topics of AI safety and security where you spend a weekend with fellow engaged researchers to dive into exciting and important research problems. Join the Discord where most communication and talks will happen and visit the front page.
How to get started with evaluations research
Read the introductory guide by Apollo Research on how to do evaluations research at the link here. You can both find ideas for your evaluations research in the "Ideas" tab and in their live updated document here.
Methodology in the field of evaluations
We are extra interested in methodologies of model evaluation since it is an important question. A few of the main issues with existing AI model evaluation methods include:
Static benchmarks are becoming saturated, as state-of-the-art models are already performing very well on many standard tests. This makes it hard to distinguish between more and less capable systems.
Benchmarks are often distant from real-world uses, so high performance may not translate to usefulness.
Internal testing by companies can be more realistic, but results are hard to compare across organizations.
Measuring real-world usefulness (e.g. developer productivity) requires carefully designing proxies based on the specifics of the task, which is challenging.
Eliciting capabilities is an art more than a science right now. What works for one model may not work for another.
Fine-tuning models for evaluations introduce questions about how much training to provide and how to accurately evaluate performance on the fine-tuned task.
Testing for model alignment is extremely difficult with little consensus on good approaches so far. Potential methods like red teaming have limitations.
Evaluating oversight methods for more advanced models than we have today remains very challenging. Approaches like sandwiching provide some help but may not fully generalize.
In summary, existing methods tend to be static, simplistic, non-comparable, and frequently disconnected from real uses.
So, how can we advance model evaluation methods? That's what we want to find out with you!
Rules
You will participate in teams of 1-5 people and submit a project on the entry submission tab. Each project is submitted with: The PDF report and your title, summary, and description. There will be a team-making event right after the keynote for anyone who is missing a team.
You are allowed to think about your project before the hackathon starts but your core research work should happen in the duration of the hackathon.
Evaluation criteria
Your evaluations reports will of course be evaluated as well!
Innovative Methodology: Evaluations is hampered by many methods being tried within behavioral and some interpretability domains, however we have yet to see principled methods for evaluation. With inspiration from other fields like experimental physics, cognitive science, and governance, you might be able to invent new methodologies that more accurately captures dangerous capabilities!
Compelling Narrative: Demos should ideally capture important themes, be accessible to researchers, and present a compelling narrative. For example, when researchers were able to prompt GPT-4 to solve a captcha using a hired human (see Section 2.9 of the GPT-4 system card).
Easily Replicable: Demos should ideally be clear, well-documented, and easily replicable. For instance, is the code openly available or is it within a Google Colab that we can run through without problems?
48
Sign Ups
Entries
Overview
Resources
Schedule
Entries
Overview

Join the fun 3 day research sprint!
Governments are at a loss as to the risks of AI. Many organizations are interested in understanding where they can deploy AI safely and where they cannot. Come join us for this weekend's effort to uncover the risks and design evaluations for understanding dangerous capabilities of language models!
Watch the keynote live and recorded below
We start the weekend with a livestreamed and recorded keynote talk introducing the topic and introducing the schedule for the weekend. Saturday and Sunday have mentoring sessions (office hours) where we encourage you to show up on the Discord. Wednesday the 29th, we host project presentations where top projects will showcase their results.
Get an introduction to doing evaluations from Apollo Research and see ideas under the "Ideas" tab for what you can spend your weekend on. See more inspiration further down the page along with the judging criteria.
Thank you to our gold sponsor Apollo Research.
There are no requirements for you to join but we recommend that you read up on the topic in the Inspiration and resources section further down.
Alignment Jam hackathons
The Alignment Jam hackathons are research sprints within topics of AI safety and security where you spend a weekend with fellow engaged researchers to dive into exciting and important research problems. Join the Discord where most communication and talks will happen and visit the front page.
How to get started with evaluations research
Read the introductory guide by Apollo Research on how to do evaluations research at the link here. You can both find ideas for your evaluations research in the "Ideas" tab and in their live updated document here.
Methodology in the field of evaluations
We are extra interested in methodologies of model evaluation since it is an important question. A few of the main issues with existing AI model evaluation methods include:
Static benchmarks are becoming saturated, as state-of-the-art models are already performing very well on many standard tests. This makes it hard to distinguish between more and less capable systems.
Benchmarks are often distant from real-world uses, so high performance may not translate to usefulness.
Internal testing by companies can be more realistic, but results are hard to compare across organizations.
Measuring real-world usefulness (e.g. developer productivity) requires carefully designing proxies based on the specifics of the task, which is challenging.
Eliciting capabilities is an art more than a science right now. What works for one model may not work for another.
Fine-tuning models for evaluations introduce questions about how much training to provide and how to accurately evaluate performance on the fine-tuned task.
Testing for model alignment is extremely difficult with little consensus on good approaches so far. Potential methods like red teaming have limitations.
Evaluating oversight methods for more advanced models than we have today remains very challenging. Approaches like sandwiching provide some help but may not fully generalize.
In summary, existing methods tend to be static, simplistic, non-comparable, and frequently disconnected from real uses.
So, how can we advance model evaluation methods? That's what we want to find out with you!
Rules
You will participate in teams of 1-5 people and submit a project on the entry submission tab. Each project is submitted with: The PDF report and your title, summary, and description. There will be a team-making event right after the keynote for anyone who is missing a team.
You are allowed to think about your project before the hackathon starts but your core research work should happen in the duration of the hackathon.
Evaluation criteria
Your evaluations reports will of course be evaluated as well!
Innovative Methodology: Evaluations is hampered by many methods being tried within behavioral and some interpretability domains, however we have yet to see principled methods for evaluation. With inspiration from other fields like experimental physics, cognitive science, and governance, you might be able to invent new methodologies that more accurately captures dangerous capabilities!
Compelling Narrative: Demos should ideally capture important themes, be accessible to researchers, and present a compelling narrative. For example, when researchers were able to prompt GPT-4 to solve a captcha using a hired human (see Section 2.9 of the GPT-4 system card).
Easily Replicable: Demos should ideally be clear, well-documented, and easily replicable. For instance, is the code openly available or is it within a Google Colab that we can run through without problems?
48
Sign Ups
Entries
Overview
Resources
Schedule
Entries
Overview

Join the fun 3 day research sprint!
Governments are at a loss as to the risks of AI. Many organizations are interested in understanding where they can deploy AI safely and where they cannot. Come join us for this weekend's effort to uncover the risks and design evaluations for understanding dangerous capabilities of language models!
Watch the keynote live and recorded below
We start the weekend with a livestreamed and recorded keynote talk introducing the topic and introducing the schedule for the weekend. Saturday and Sunday have mentoring sessions (office hours) where we encourage you to show up on the Discord. Wednesday the 29th, we host project presentations where top projects will showcase their results.
Get an introduction to doing evaluations from Apollo Research and see ideas under the "Ideas" tab for what you can spend your weekend on. See more inspiration further down the page along with the judging criteria.
Thank you to our gold sponsor Apollo Research.
There are no requirements for you to join but we recommend that you read up on the topic in the Inspiration and resources section further down.
Alignment Jam hackathons
The Alignment Jam hackathons are research sprints within topics of AI safety and security where you spend a weekend with fellow engaged researchers to dive into exciting and important research problems. Join the Discord where most communication and talks will happen and visit the front page.
How to get started with evaluations research
Read the introductory guide by Apollo Research on how to do evaluations research at the link here. You can both find ideas for your evaluations research in the "Ideas" tab and in their live updated document here.
Methodology in the field of evaluations
We are extra interested in methodologies of model evaluation since it is an important question. A few of the main issues with existing AI model evaluation methods include:
Static benchmarks are becoming saturated, as state-of-the-art models are already performing very well on many standard tests. This makes it hard to distinguish between more and less capable systems.
Benchmarks are often distant from real-world uses, so high performance may not translate to usefulness.
Internal testing by companies can be more realistic, but results are hard to compare across organizations.
Measuring real-world usefulness (e.g. developer productivity) requires carefully designing proxies based on the specifics of the task, which is challenging.
Eliciting capabilities is an art more than a science right now. What works for one model may not work for another.
Fine-tuning models for evaluations introduce questions about how much training to provide and how to accurately evaluate performance on the fine-tuned task.
Testing for model alignment is extremely difficult with little consensus on good approaches so far. Potential methods like red teaming have limitations.
Evaluating oversight methods for more advanced models than we have today remains very challenging. Approaches like sandwiching provide some help but may not fully generalize.
In summary, existing methods tend to be static, simplistic, non-comparable, and frequently disconnected from real uses.
So, how can we advance model evaluation methods? That's what we want to find out with you!
Rules
You will participate in teams of 1-5 people and submit a project on the entry submission tab. Each project is submitted with: The PDF report and your title, summary, and description. There will be a team-making event right after the keynote for anyone who is missing a team.
You are allowed to think about your project before the hackathon starts but your core research work should happen in the duration of the hackathon.
Evaluation criteria
Your evaluations reports will of course be evaluated as well!
Innovative Methodology: Evaluations is hampered by many methods being tried within behavioral and some interpretability domains, however we have yet to see principled methods for evaluation. With inspiration from other fields like experimental physics, cognitive science, and governance, you might be able to invent new methodologies that more accurately captures dangerous capabilities!
Compelling Narrative: Demos should ideally capture important themes, be accessible to researchers, and present a compelling narrative. For example, when researchers were able to prompt GPT-4 to solve a captcha using a hired human (see Section 2.9 of the GPT-4 system card).
Easily Replicable: Demos should ideally be clear, well-documented, and easily replicable. For instance, is the code openly available or is it within a Google Colab that we can run through without problems?
48
Sign Ups
Entries
Overview
Resources
Schedule
Entries
Overview

Join the fun 3 day research sprint!
Governments are at a loss as to the risks of AI. Many organizations are interested in understanding where they can deploy AI safely and where they cannot. Come join us for this weekend's effort to uncover the risks and design evaluations for understanding dangerous capabilities of language models!
Watch the keynote live and recorded below
We start the weekend with a livestreamed and recorded keynote talk introducing the topic and introducing the schedule for the weekend. Saturday and Sunday have mentoring sessions (office hours) where we encourage you to show up on the Discord. Wednesday the 29th, we host project presentations where top projects will showcase their results.
Get an introduction to doing evaluations from Apollo Research and see ideas under the "Ideas" tab for what you can spend your weekend on. See more inspiration further down the page along with the judging criteria.
Thank you to our gold sponsor Apollo Research.
There are no requirements for you to join but we recommend that you read up on the topic in the Inspiration and resources section further down.
Alignment Jam hackathons
The Alignment Jam hackathons are research sprints within topics of AI safety and security where you spend a weekend with fellow engaged researchers to dive into exciting and important research problems. Join the Discord where most communication and talks will happen and visit the front page.
How to get started with evaluations research
Read the introductory guide by Apollo Research on how to do evaluations research at the link here. You can both find ideas for your evaluations research in the "Ideas" tab and in their live updated document here.
Methodology in the field of evaluations
We are extra interested in methodologies of model evaluation since it is an important question. A few of the main issues with existing AI model evaluation methods include:
Static benchmarks are becoming saturated, as state-of-the-art models are already performing very well on many standard tests. This makes it hard to distinguish between more and less capable systems.
Benchmarks are often distant from real-world uses, so high performance may not translate to usefulness.
Internal testing by companies can be more realistic, but results are hard to compare across organizations.
Measuring real-world usefulness (e.g. developer productivity) requires carefully designing proxies based on the specifics of the task, which is challenging.
Eliciting capabilities is an art more than a science right now. What works for one model may not work for another.
Fine-tuning models for evaluations introduce questions about how much training to provide and how to accurately evaluate performance on the fine-tuned task.
Testing for model alignment is extremely difficult with little consensus on good approaches so far. Potential methods like red teaming have limitations.
Evaluating oversight methods for more advanced models than we have today remains very challenging. Approaches like sandwiching provide some help but may not fully generalize.
In summary, existing methods tend to be static, simplistic, non-comparable, and frequently disconnected from real uses.
So, how can we advance model evaluation methods? That's what we want to find out with you!
Rules
You will participate in teams of 1-5 people and submit a project on the entry submission tab. Each project is submitted with: The PDF report and your title, summary, and description. There will be a team-making event right after the keynote for anyone who is missing a team.
You are allowed to think about your project before the hackathon starts but your core research work should happen in the duration of the hackathon.
Evaluation criteria
Your evaluations reports will of course be evaluated as well!
Innovative Methodology: Evaluations is hampered by many methods being tried within behavioral and some interpretability domains, however we have yet to see principled methods for evaluation. With inspiration from other fields like experimental physics, cognitive science, and governance, you might be able to invent new methodologies that more accurately captures dangerous capabilities!
Compelling Narrative: Demos should ideally capture important themes, be accessible to researchers, and present a compelling narrative. For example, when researchers were able to prompt GPT-4 to solve a captcha using a hired human (see Section 2.9 of the GPT-4 system card).
Easily Replicable: Demos should ideally be clear, well-documented, and easily replicable. For instance, is the code openly available or is it within a Google Colab that we can run through without problems?
Registered Jam Sites
Register A Location
Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.
The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community.
Registered Jam Sites
Register A Location
Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.
The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community.
Our Other Sprints
May 30, 2025
-
Jun 1, 2025
Research
Apart x Martian Mechanistic Interpretability Hackathon
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible
Sign Up
Sign Up
Sign Up
Apr 25, 2025
-
Apr 27, 2025
Research
Economics of Transformative AI
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible
Sign Up
Sign Up
Sign Up

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events