Organized by the AI safety initiatives at Georgia Tech and and the University of Michigan.
The development of artificial general intelligence (AGI) could be one of humanity's greatest achievements – or one of its greatest challenges. With only one AI safety researcher for every 250 working on advancing AI capabilities, we urgently need more talented minds focused on ensuring AI systems remain safe and aligned with human values.
The Deadus AI Safety Hackathon invites high school and undergraduate students to tackle crucial challenges in AI safety alongside leading researchers and practitioners. Whether you're interested in technical alignment or AI governance, this is your opportunity to make a real impact on one of the most pressing issues of our time.
Beginner (High School)
Advanced (Undergraduate)
Beginner (High School)
Advanced (Undergraduate)
Ready to help shape the future of AI safety? Sign up now and join a community of forward-thinking individuals working to ensure artificial intelligence remains beneficial for humanity.
The goal of core readings is to explain what AI safety & alignment is, and why we care about it. This should cover fundamental arguments for and against AI safety and the current AI landscape, and not very technical. These readings will also help participants decide whether or not they’re interested in the topic.
AI safety in the news / expert opinions, optional
This field develops rigorous methods to assess AI models across multiple dimensions - from testing their reasoning abilities and checking for subtle biases, to evaluating their honesty and probing for potentially deceptive behaviors. A major focus is detecting signs of misalignment: Is the model optimizing for what we truly want, or has it learned to optimize for something else? Researchers create sophisticated test scenarios and adversarial challenges to expose potential weaknesses, failure modes, or optimization goals that might not be apparent in standard testing.
A key emphasis is evaluating models' alignment properties: Do they maintain goal-directed behavior that matches human intent? Can they reliably recognize and acknowledge their own limitations rather than making overconfident mistakes? How do they handle ambiguous or ethically challenging scenarios where simple reward functions might lead to unintended consequences? By systematically investigating these questions, model evaluation helps ensure AI systems are not just powerful, but also reliably aligned with human values and objectives. This includes testing for subtle forms of deception, reward hacking, and goal preservation under distribution shift - all critical challenges for building trustworthy AI.
Readings
-
- this article goes over a bunch of common problems encountered when thinking about alignment and how to judge whether a model is aligned— very digestible and beginner-level reading for introduction
-
by Apollo research
- Don’t need to read “current work in….” and “field building” sections, though the former might give ideas!
-
Resources
-
- From Apollo. This is a great intro resource.
-
- (more advanced) Apollo’s list of evals papers to be familiar with. This isn’t necessarily “core reading”, and is heavily opinionated based on Apollos research agenda, but most of the “Our Favorite Papers” items are worth being familiar with.
- More advanced technical guide with code: ARENA week 3:
Ideas/example projects
(A good rule-of-thumb is that most questions about how language models behave are probably under-explored)
-
- Evals for
(Owain Evans)
-
Mechanistic interpretability researchers aim to reverse engineer AI models, examining their internal components and operations - much like neuroscientists studying the human brain. Rather than treating AI systems as black boxes, they develop methods to map out how these systems process information and reach conclusions.
This work is crucial for AI safety - understanding how models work enables us to make them more reliable and controllable. When we know exactly how an AI system makes decisions, we can better predict failures, prevent unwanted behaviors, and ensure alignment with human values.
The field blends computer science, neuroscience, mathematics, and philosophy to understand how neural networks recognize patterns, what internal representations they build, and how distinct computational circuits within them work together.
Readings
-
- I think this article does the best job explaining some key terms (polysemantic neurons, SAEs, superposition) naturally
-
- this article establishes why interpretability is important and then goes into some examples and ways of thinking about how to interpret specific kinds of models. (also has a glossary w/ important definitions beginners might not know at the end of the article)
-
This article introduces the problem of interpretability and current work in an easy to understand way. It then explains new research interpreting the activation of features when specific words are prompted to an LLM.
Resources
- Pop-sci intro:
-
-
- Neel Nanda:
- Neel Nanda:
- Neel Nanda:
- More advanced:
-
- good in depth video about why interpretability is important and difficulties in making models interpretable as they get smarter (scalable oversight)
Ideas/example projects
- Making a CNN, training it, and demonstrating the application of different interpretability libraries like GRADCam.
Adversarial robustness explores a fascinating vulnerability at the heart of modern AI systems: their susceptibility to carefully crafted inputs designed to fool them. While an AI might correctly identify thousands of normal images of cats, a few carefully placed pixels can trick it into seeing a toaster instead - a weakness that humans typically don't share.
This field studies how to make AI systems resilient against these "adversarial attacks" - minimal changes to inputs that cause maximal confusion in model outputs. The stakes extend far beyond academic curiosity. As AI systems are deployed in critical applications like medical diagnosis or autonomous vehicles, ensuring they remain reliable even when faced with manipulated inputs becomes crucial for safety.
Readings
-
- explains adversarial attacks and why it is a core problem in AI security. Gives high-level examples of adversarial attacks as well as defenses.
-
- less general, more advanced, though more current. Probably not worth reading unless interested. This got some buzz a few months ago.
-
Short reading about what red-teaming is about in general tech fields. Can give context to what it is in AI
Resources
-
Readings
-
- From JHU Policy Hackathon, good overview on the space so far.
-
- Goes into more specifics on what could be regulated
-
– Delineates the UN’s first resolution on AI which was signed by 120 states including China
-
– overview of governance risks
Resources
Ideas/example projects
Readings
-
(only pages 272-279) Introduces theoretical frameworks to the “rules” that AI should follow.
-
quick introduction to AI regulation strategies
-
Covers basic needs for AI regulation and then does a basic review of several possible regulation strategies
-
introduction to the idea of compute governance and why it’s important
-
Good place to get some ideas
-
– overview of the EU AI act; useful to read source text.
-
Resources
Ideas/example projects
Readings
-
(fairness reading, not data privacy, but idk where else to put) (pages 273-289)
-
Explains a lawsuit that got filed against GitHub after they allegedly trained CoPilot on user code without permission.
-
An app that protects artists against having their art used to train GenAI without their consent.
-
- its sibling program Nightshade actively poisons models that are trained on it, can be a really interesting conversation starter
Resources
-
-
Ideas/example projects
The schedule runs from 4 PM UTC Friday to 3 AM Monday. We start with an introductory talk and end the event during the following week with an awards ceremony. Join the public ICal here.You will also find Explorer events, such as collaborative brainstorming and team match-making before the hackathon begins on Discord and in the calendar.