Mar 22, 2024
-
Mar 25, 2024
Online & In-Person
Code Red Hackathon: Can LLMs Replicate and do R&D by Themselves?




Join our groundbreaking hackathon to create tests that identify when AI can copy itself - a critical safety risk. Your innovative tasks could shape global AI standards. Make your mark, potentially collaborate with a prominent AI safety lab, and win big. Sign up now for an impactful weekend!
00:00:00:00
00:00:00:00
00:00:00:00
00:00:00:00
Join our groundbreaking hackathon to create tests that identify when AI can copy itself - a critical safety risk. Your innovative tasks could shape global AI standards. Make your mark, potentially collaborate with a prominent AI safety lab, and win big. Sign up now for an impactful weekend!
This event is ongoing.
This event has concluded.
168
Sign Ups
Entries
Overview
Resources
Schedule
Entries
Overview


Among:
💡 226 task ideas
📝 100 task specifications
🧪 20 task implementations (with more to come)
👩🔬 Expose LLMs' Self-Replication Risks & Earn $4K+ Bounties!
Join our groundbreaking hackathon to create tests that identify when AI can copy itself - a critical safety risk. Your innovative tasks could shape global AI standards. Make your mark, potentially collaborate with a prominent AI safety lab, and win big. Sign up now for an impactful weekend!
Visit the hackathon slides

👋 Introduction
Imagine an AI that can match human intelligence, capable of performing tasks across the board. But there's a catch: if such AI gets in the wrong hands, this could spell disaster, especially if it can replicate and defy control measures. That's where we step in.
In this hackathon, we develop the tests that will be used to evaluate today's best LLMs for autonomous replication (the ability to copy itself). During the 54 hours the hackathon runs (Friday to Sunday), we will develop tasks to test AI systems on that are:
Challenging: Designed to push the boundaries, requiring significant skill and time (6-20+ hours) to complete, tested by a human.
Measurable: With a clear, automatic scoring from 0 to 1, allowing us to discern the nuanced capabilities of AI.
Realistic: Grounded in real-world scenarios, ensuring tasks are relevant and free from bugs or loopholes.
If you're ready to test the limits of AI, sign up above and read on!
😎 The Plan (and Prizes!)
Welcome! During this exciting community weekend, you'll use the great starter resources to develop a new task implementation from your own or existing ideas. Using the process described in the "More Resources" tab, we'll be creating high-quality, error-free, and reliable tasks for agents by following these steps:
Write up your ideas for which tasks related to autonomous capabilities you wish to test the language model on
A $20 prize will be awarded for high-quality ideas
An example idea might be "Set up Tor and/or a VPN on a new server so that network traffic cannot be traced." You can read more here.
Create a specification for the task that includes the prompt, a text description of what the test-taker has access to, and a self-evaluation of the task
A $200 prize will be awarded for high-quality specifications (2-6 hours of work)
Create the materials for the task (instructions, libraries, and tool access) and have a human run through the whole task with these exact materials and tools
Implement the task in the task standard format, test it with a simple agent, and submit it!
The prize for high-quality implementations is 3x a human professional's salary for the task + bonus, e.g. a task that would take a human software engineer 10 hours could net you up to $4500 (6-12 hours of work in addition to quality assurance)
We also want to make sure data does not go into the training set for models such as ChatGPT, Claude, and Gemini. Therefore:
Please read and indicate you've read the IP & Data Security Agreement (sign on this form)
Read through the resources
As you go through the guide, we highly recommend you read the resources presented in the "More Resources" tab! METR's task standard resources are high quality and provides a good context for your evaluation work.
Begin writing up your ideas
The important part of this hackathon is the implementation, and to ensure this goes well, it's super helpful to have your ideas ready before kickoff! Check out the existing ideas that you are welcome to work on directly on this live airtable or write your own ideas and submit them here.
Join for the keynote and kickoff
On Friday the 22nd of March, Beth Barnes, CEO & Lead Researcher at METR, will take the stage to introduce you to the weekend's task and give an introduction to the field. This talk will be livestreamed and recorded so you can join over the weekend from anywhere in the world! If you have a location with some friends, go to "Hackathon sites" and register your location.
Create your task!
Under the "More resources" tab, you will find all the resources you need to get set up. This will include a full starter package of code and instructions to get you started!
Remember that you can get feedback directly from the team that evaluates your work during the weekend's office hours and talk Q&As. See more under the "Schedule".
Submit your task
Go to the "Submission" tab and send in your task files!
🙋♀️ Who should participate?
We invite established professionals, AI researchers, cybersecurity professionals, students, writers, cognitive scientists, and many more to join! Task quality will only increase as you get to converse with a diverse crowd of professionals.
🏆 Prizes
The prizes are given to any task that fulfills the core criteria. The prize will be:
A base price of 3 times the estimated cost of getting a human to complete the task (e.g. $3,000 if the task takes 10 hours for a $100/hour software engineer) and
a percentage based on how well it fulfills other specified criteria (potentially upwards of 50%; $4,500 with the last example)
✅ Criteria
For this hackathon, the criteria are even more important than usual. These tasks might be used by frontier AI labs and government agencies!
Core criteria:
Challenging Tasks: Each task should take a professional more than 30 minutes, and ideally, some could take over 10 hours. We aim to avoid any tasks that are too simple or have hidden issues that could misrepresent the AI's ability.
Suitable for AI: Tasks should mainly involve coding, using command lines, or working with text—activities that AI models like language models excel at.
Bonus Criteria:
Consistent Difficulty: The difficulty of the task remains the same over time and doesn't require internet access to complete.
True AI Skill: Tasks should test the AI's real ability and not just rely on what it has memorized. They should in theory be doable by a basic version of ChatGPT without extra help.
🌎 Host a local group
If you are part of a local machine learning or AI safety group, we invite you to set up a local in-person site for this hackathon! We will have several across the world and you can easily sign up under the “Hackathon sites” tab above where you will also find templates to share with your friends on social media.
168
Sign Ups
Entries
Overview
Resources
Schedule
Entries
Overview


Among:
💡 226 task ideas
📝 100 task specifications
🧪 20 task implementations (with more to come)
👩🔬 Expose LLMs' Self-Replication Risks & Earn $4K+ Bounties!
Join our groundbreaking hackathon to create tests that identify when AI can copy itself - a critical safety risk. Your innovative tasks could shape global AI standards. Make your mark, potentially collaborate with a prominent AI safety lab, and win big. Sign up now for an impactful weekend!
Visit the hackathon slides

👋 Introduction
Imagine an AI that can match human intelligence, capable of performing tasks across the board. But there's a catch: if such AI gets in the wrong hands, this could spell disaster, especially if it can replicate and defy control measures. That's where we step in.
In this hackathon, we develop the tests that will be used to evaluate today's best LLMs for autonomous replication (the ability to copy itself). During the 54 hours the hackathon runs (Friday to Sunday), we will develop tasks to test AI systems on that are:
Challenging: Designed to push the boundaries, requiring significant skill and time (6-20+ hours) to complete, tested by a human.
Measurable: With a clear, automatic scoring from 0 to 1, allowing us to discern the nuanced capabilities of AI.
Realistic: Grounded in real-world scenarios, ensuring tasks are relevant and free from bugs or loopholes.
If you're ready to test the limits of AI, sign up above and read on!
😎 The Plan (and Prizes!)
Welcome! During this exciting community weekend, you'll use the great starter resources to develop a new task implementation from your own or existing ideas. Using the process described in the "More Resources" tab, we'll be creating high-quality, error-free, and reliable tasks for agents by following these steps:
Write up your ideas for which tasks related to autonomous capabilities you wish to test the language model on
A $20 prize will be awarded for high-quality ideas
An example idea might be "Set up Tor and/or a VPN on a new server so that network traffic cannot be traced." You can read more here.
Create a specification for the task that includes the prompt, a text description of what the test-taker has access to, and a self-evaluation of the task
A $200 prize will be awarded for high-quality specifications (2-6 hours of work)
Create the materials for the task (instructions, libraries, and tool access) and have a human run through the whole task with these exact materials and tools
Implement the task in the task standard format, test it with a simple agent, and submit it!
The prize for high-quality implementations is 3x a human professional's salary for the task + bonus, e.g. a task that would take a human software engineer 10 hours could net you up to $4500 (6-12 hours of work in addition to quality assurance)
We also want to make sure data does not go into the training set for models such as ChatGPT, Claude, and Gemini. Therefore:
Please read and indicate you've read the IP & Data Security Agreement (sign on this form)
Read through the resources
As you go through the guide, we highly recommend you read the resources presented in the "More Resources" tab! METR's task standard resources are high quality and provides a good context for your evaluation work.
Begin writing up your ideas
The important part of this hackathon is the implementation, and to ensure this goes well, it's super helpful to have your ideas ready before kickoff! Check out the existing ideas that you are welcome to work on directly on this live airtable or write your own ideas and submit them here.
Join for the keynote and kickoff
On Friday the 22nd of March, Beth Barnes, CEO & Lead Researcher at METR, will take the stage to introduce you to the weekend's task and give an introduction to the field. This talk will be livestreamed and recorded so you can join over the weekend from anywhere in the world! If you have a location with some friends, go to "Hackathon sites" and register your location.
Create your task!
Under the "More resources" tab, you will find all the resources you need to get set up. This will include a full starter package of code and instructions to get you started!
Remember that you can get feedback directly from the team that evaluates your work during the weekend's office hours and talk Q&As. See more under the "Schedule".
Submit your task
Go to the "Submission" tab and send in your task files!
🙋♀️ Who should participate?
We invite established professionals, AI researchers, cybersecurity professionals, students, writers, cognitive scientists, and many more to join! Task quality will only increase as you get to converse with a diverse crowd of professionals.
🏆 Prizes
The prizes are given to any task that fulfills the core criteria. The prize will be:
A base price of 3 times the estimated cost of getting a human to complete the task (e.g. $3,000 if the task takes 10 hours for a $100/hour software engineer) and
a percentage based on how well it fulfills other specified criteria (potentially upwards of 50%; $4,500 with the last example)
✅ Criteria
For this hackathon, the criteria are even more important than usual. These tasks might be used by frontier AI labs and government agencies!
Core criteria:
Challenging Tasks: Each task should take a professional more than 30 minutes, and ideally, some could take over 10 hours. We aim to avoid any tasks that are too simple or have hidden issues that could misrepresent the AI's ability.
Suitable for AI: Tasks should mainly involve coding, using command lines, or working with text—activities that AI models like language models excel at.
Bonus Criteria:
Consistent Difficulty: The difficulty of the task remains the same over time and doesn't require internet access to complete.
True AI Skill: Tasks should test the AI's real ability and not just rely on what it has memorized. They should in theory be doable by a basic version of ChatGPT without extra help.
🌎 Host a local group
If you are part of a local machine learning or AI safety group, we invite you to set up a local in-person site for this hackathon! We will have several across the world and you can easily sign up under the “Hackathon sites” tab above where you will also find templates to share with your friends on social media.
168
Sign Ups
Entries
Overview
Resources
Schedule
Entries
Overview


Among:
💡 226 task ideas
📝 100 task specifications
🧪 20 task implementations (with more to come)
👩🔬 Expose LLMs' Self-Replication Risks & Earn $4K+ Bounties!
Join our groundbreaking hackathon to create tests that identify when AI can copy itself - a critical safety risk. Your innovative tasks could shape global AI standards. Make your mark, potentially collaborate with a prominent AI safety lab, and win big. Sign up now for an impactful weekend!
Visit the hackathon slides

👋 Introduction
Imagine an AI that can match human intelligence, capable of performing tasks across the board. But there's a catch: if such AI gets in the wrong hands, this could spell disaster, especially if it can replicate and defy control measures. That's where we step in.
In this hackathon, we develop the tests that will be used to evaluate today's best LLMs for autonomous replication (the ability to copy itself). During the 54 hours the hackathon runs (Friday to Sunday), we will develop tasks to test AI systems on that are:
Challenging: Designed to push the boundaries, requiring significant skill and time (6-20+ hours) to complete, tested by a human.
Measurable: With a clear, automatic scoring from 0 to 1, allowing us to discern the nuanced capabilities of AI.
Realistic: Grounded in real-world scenarios, ensuring tasks are relevant and free from bugs or loopholes.
If you're ready to test the limits of AI, sign up above and read on!
😎 The Plan (and Prizes!)
Welcome! During this exciting community weekend, you'll use the great starter resources to develop a new task implementation from your own or existing ideas. Using the process described in the "More Resources" tab, we'll be creating high-quality, error-free, and reliable tasks for agents by following these steps:
Write up your ideas for which tasks related to autonomous capabilities you wish to test the language model on
A $20 prize will be awarded for high-quality ideas
An example idea might be "Set up Tor and/or a VPN on a new server so that network traffic cannot be traced." You can read more here.
Create a specification for the task that includes the prompt, a text description of what the test-taker has access to, and a self-evaluation of the task
A $200 prize will be awarded for high-quality specifications (2-6 hours of work)
Create the materials for the task (instructions, libraries, and tool access) and have a human run through the whole task with these exact materials and tools
Implement the task in the task standard format, test it with a simple agent, and submit it!
The prize for high-quality implementations is 3x a human professional's salary for the task + bonus, e.g. a task that would take a human software engineer 10 hours could net you up to $4500 (6-12 hours of work in addition to quality assurance)
We also want to make sure data does not go into the training set for models such as ChatGPT, Claude, and Gemini. Therefore:
Please read and indicate you've read the IP & Data Security Agreement (sign on this form)
Read through the resources
As you go through the guide, we highly recommend you read the resources presented in the "More Resources" tab! METR's task standard resources are high quality and provides a good context for your evaluation work.
Begin writing up your ideas
The important part of this hackathon is the implementation, and to ensure this goes well, it's super helpful to have your ideas ready before kickoff! Check out the existing ideas that you are welcome to work on directly on this live airtable or write your own ideas and submit them here.
Join for the keynote and kickoff
On Friday the 22nd of March, Beth Barnes, CEO & Lead Researcher at METR, will take the stage to introduce you to the weekend's task and give an introduction to the field. This talk will be livestreamed and recorded so you can join over the weekend from anywhere in the world! If you have a location with some friends, go to "Hackathon sites" and register your location.
Create your task!
Under the "More resources" tab, you will find all the resources you need to get set up. This will include a full starter package of code and instructions to get you started!
Remember that you can get feedback directly from the team that evaluates your work during the weekend's office hours and talk Q&As. See more under the "Schedule".
Submit your task
Go to the "Submission" tab and send in your task files!
🙋♀️ Who should participate?
We invite established professionals, AI researchers, cybersecurity professionals, students, writers, cognitive scientists, and many more to join! Task quality will only increase as you get to converse with a diverse crowd of professionals.
🏆 Prizes
The prizes are given to any task that fulfills the core criteria. The prize will be:
A base price of 3 times the estimated cost of getting a human to complete the task (e.g. $3,000 if the task takes 10 hours for a $100/hour software engineer) and
a percentage based on how well it fulfills other specified criteria (potentially upwards of 50%; $4,500 with the last example)
✅ Criteria
For this hackathon, the criteria are even more important than usual. These tasks might be used by frontier AI labs and government agencies!
Core criteria:
Challenging Tasks: Each task should take a professional more than 30 minutes, and ideally, some could take over 10 hours. We aim to avoid any tasks that are too simple or have hidden issues that could misrepresent the AI's ability.
Suitable for AI: Tasks should mainly involve coding, using command lines, or working with text—activities that AI models like language models excel at.
Bonus Criteria:
Consistent Difficulty: The difficulty of the task remains the same over time and doesn't require internet access to complete.
True AI Skill: Tasks should test the AI's real ability and not just rely on what it has memorized. They should in theory be doable by a basic version of ChatGPT without extra help.
🌎 Host a local group
If you are part of a local machine learning or AI safety group, we invite you to set up a local in-person site for this hackathon! We will have several across the world and you can easily sign up under the “Hackathon sites” tab above where you will also find templates to share with your friends on social media.
168
Sign Ups
Entries
Overview
Resources
Schedule
Entries
Overview


Among:
💡 226 task ideas
📝 100 task specifications
🧪 20 task implementations (with more to come)
👩🔬 Expose LLMs' Self-Replication Risks & Earn $4K+ Bounties!
Join our groundbreaking hackathon to create tests that identify when AI can copy itself - a critical safety risk. Your innovative tasks could shape global AI standards. Make your mark, potentially collaborate with a prominent AI safety lab, and win big. Sign up now for an impactful weekend!
Visit the hackathon slides

👋 Introduction
Imagine an AI that can match human intelligence, capable of performing tasks across the board. But there's a catch: if such AI gets in the wrong hands, this could spell disaster, especially if it can replicate and defy control measures. That's where we step in.
In this hackathon, we develop the tests that will be used to evaluate today's best LLMs for autonomous replication (the ability to copy itself). During the 54 hours the hackathon runs (Friday to Sunday), we will develop tasks to test AI systems on that are:
Challenging: Designed to push the boundaries, requiring significant skill and time (6-20+ hours) to complete, tested by a human.
Measurable: With a clear, automatic scoring from 0 to 1, allowing us to discern the nuanced capabilities of AI.
Realistic: Grounded in real-world scenarios, ensuring tasks are relevant and free from bugs or loopholes.
If you're ready to test the limits of AI, sign up above and read on!
😎 The Plan (and Prizes!)
Welcome! During this exciting community weekend, you'll use the great starter resources to develop a new task implementation from your own or existing ideas. Using the process described in the "More Resources" tab, we'll be creating high-quality, error-free, and reliable tasks for agents by following these steps:
Write up your ideas for which tasks related to autonomous capabilities you wish to test the language model on
A $20 prize will be awarded for high-quality ideas
An example idea might be "Set up Tor and/or a VPN on a new server so that network traffic cannot be traced." You can read more here.
Create a specification for the task that includes the prompt, a text description of what the test-taker has access to, and a self-evaluation of the task
A $200 prize will be awarded for high-quality specifications (2-6 hours of work)
Create the materials for the task (instructions, libraries, and tool access) and have a human run through the whole task with these exact materials and tools
Implement the task in the task standard format, test it with a simple agent, and submit it!
The prize for high-quality implementations is 3x a human professional's salary for the task + bonus, e.g. a task that would take a human software engineer 10 hours could net you up to $4500 (6-12 hours of work in addition to quality assurance)
We also want to make sure data does not go into the training set for models such as ChatGPT, Claude, and Gemini. Therefore:
Please read and indicate you've read the IP & Data Security Agreement (sign on this form)
Read through the resources
As you go through the guide, we highly recommend you read the resources presented in the "More Resources" tab! METR's task standard resources are high quality and provides a good context for your evaluation work.
Begin writing up your ideas
The important part of this hackathon is the implementation, and to ensure this goes well, it's super helpful to have your ideas ready before kickoff! Check out the existing ideas that you are welcome to work on directly on this live airtable or write your own ideas and submit them here.
Join for the keynote and kickoff
On Friday the 22nd of March, Beth Barnes, CEO & Lead Researcher at METR, will take the stage to introduce you to the weekend's task and give an introduction to the field. This talk will be livestreamed and recorded so you can join over the weekend from anywhere in the world! If you have a location with some friends, go to "Hackathon sites" and register your location.
Create your task!
Under the "More resources" tab, you will find all the resources you need to get set up. This will include a full starter package of code and instructions to get you started!
Remember that you can get feedback directly from the team that evaluates your work during the weekend's office hours and talk Q&As. See more under the "Schedule".
Submit your task
Go to the "Submission" tab and send in your task files!
🙋♀️ Who should participate?
We invite established professionals, AI researchers, cybersecurity professionals, students, writers, cognitive scientists, and many more to join! Task quality will only increase as you get to converse with a diverse crowd of professionals.
🏆 Prizes
The prizes are given to any task that fulfills the core criteria. The prize will be:
A base price of 3 times the estimated cost of getting a human to complete the task (e.g. $3,000 if the task takes 10 hours for a $100/hour software engineer) and
a percentage based on how well it fulfills other specified criteria (potentially upwards of 50%; $4,500 with the last example)
✅ Criteria
For this hackathon, the criteria are even more important than usual. These tasks might be used by frontier AI labs and government agencies!
Core criteria:
Challenging Tasks: Each task should take a professional more than 30 minutes, and ideally, some could take over 10 hours. We aim to avoid any tasks that are too simple or have hidden issues that could misrepresent the AI's ability.
Suitable for AI: Tasks should mainly involve coding, using command lines, or working with text—activities that AI models like language models excel at.
Bonus Criteria:
Consistent Difficulty: The difficulty of the task remains the same over time and doesn't require internet access to complete.
True AI Skill: Tasks should test the AI's real ability and not just rely on what it has memorized. They should in theory be doable by a basic version of ChatGPT without extra help.
🌎 Host a local group
If you are part of a local machine learning or AI safety group, we invite you to set up a local in-person site for this hackathon! We will have several across the world and you can easily sign up under the “Hackathon sites” tab above where you will also find templates to share with your friends on social media.
Registered Jam Sites
Register A Location
Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.
The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community.
Registered Jam Sites
Register A Location
Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.
The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community.
Our Other Sprints
May 30, 2025
-
Jun 1, 2025
Research
Apart x Martian Mechanistic Interpretability Hackathon
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible
Sign Up
Sign Up
Sign Up
Apr 25, 2025
-
Apr 27, 2025
Research
Economics of Transformative AI
This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible
Sign Up
Sign Up
Sign Up

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events

Sign up to stay updated on the
latest news, research, and events