Our most popular Hackathon ever
Our successful Women in AI Safety hackathon is complete and the winning teams have been notified.
Congratulations to the hundreds who joined us over the weekend from all over the world - this was our most popular Hackathon ever.

We had physical Jam Sites all around the world: Singapore, England, India, France, and more. With the rest hacking online.
Winners
All of our 10+ fantastic judges remarked how very impressed they were by the quality of the submissions. All across our Social Sciences, Mechanistic Interpretability, and Education tracks - the submissions were creative, thoughtful and some had real-world implications for AI safety and security. Now we get to the winners…
[1] Social Sciences Track: Detecting Malicious AI Agents Through Simulated Interactions By Yulu Pi, Anna Becker, Ella Bettison

This research investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated users in various decision-making contexts.
They also examine how interaction depth and ability of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. The findings seem to reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers.
These findings underscore critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.
The judges thought it was a well designed experimental methodology across diverse decision-making scenarios. They also saw that this has obvious implications for AI safety in real-world deployment contexts.
[2] Mechanistic Interpretability Track: Red-teaming with Mech-Interpretability, by Devina Jain

The judges thought this mech-interp submission was a really innovative combination of mechanistic interpretability with red teaming.
Red teaming large language models (LLMs) is crucial for identifying vulnerabilities before deployment, yet systematically creating effective adversarial prompts remains challenging. This project introduces a novel approach that leverages mechanistic interpretability to enhance red teaming efficiency.
They developed a system that analyzes prompt effectiveness using neural activation patterns from the Goodfire API. By scraping 1,034 successful jailbreak attempts from JailbreakBench and combining them with 2,000 benign interactions from UltraChat, we created a balanced dataset of harmful and helpful prompts.
This allowed this team 'to train a 3-layer MLP classifier that identifies "high entropy" prompts—those most likely to elicit unsafe model behaviors.'
[3] Education Track: Morph: AI Safety Education Adaptable to (Almost) Anyone, by Shafira Noh, Wan Aimran

In their own words, this team was moved by their thesis that 'AI safety education struggles with cultural homogeneity, abstract technical content, and unclear learning and post-learning pathways, alienating global audiences.'
They seek to address these gaps with an integrated platform combining culturally adaptive content, and the judges were super impressed by this idea. Great work!
Come join us
We look forward to welcoming some of the more promising projects to Apart Research's Studio. As always, keep updated here and we hope to see you at our next hackathon.