Studio Progress Report

We designed the Studio to help our best hackathon participants in developing their promising AI safety ideas by developing and disseminating their work.

In our first cohort of eight teams which we onboarded in late 2024, six are now ready to share their research just 2 months later. They mostly hail from the following hackathons:

By taking everything we learned from the first round, we are now confident in our ability to move promising projects from hackathon to public dissemination within weeks. We have also onboarded another six new teams in the first five weeks of 2025 alone.

This post will give you a flavor of the projects coming out of the Studio. We look forward to the discussions these short write-ups will trigger and we welcome your feedback.

‍

AI Hackers in the Wild: Detecting LLM-Based Attacks

‍By Reworr

The rise of AI capabilities has opened new frontiers in cybersecurity threats, particularly through sophisticated and adaptive hacking agents powered by LLMs. To understand this emerging risk, Reworr deployed a network of "honeypot" servers - intentionally vulnerable systems designed to attract and study cyber attacks.

These honeypots were modified with special detection mechanisms to identify AI agents among the attackers, using both prompt injection traps and timing analysis. Out of over 7.5 million attack attempts recorded over three months, only one was confirmed as an AI agent, with seven others showing potential AI characteristics. This is a very interesting finding because it means we have found a way to detect AI hackers in the wild!

This represents approximately 0.0001% of total attacks, suggesting we're still in the early "innovator" phase of technology adoption. The detection system leveraged the fact that LLM agents typically respond in under 1.5 seconds, while humans take longer, combined with their unique responses to embedded prompt injections.

We can actually detect AI hackers in the wild now, which is a good thing.

Read the full writeup here.

‍

Detecting AI Agent Failure Modes by Testing in Simulations

By Michael Andrzejewski

As AI agents become increasingly common in web scraping, robotics, and even military applications, testing their behavior in low-risk environments becomes crucial. Minecraft anyone?

The experiment focused on a simple alignment challenge: instructing an AI agent to maximize diamond blocks within strictly defined boundaries while adhering to spatial constraints.

Any simulation for testing AI agents should be replicable, realistic, understandable and editable. Minecraft exhibits all four of: replicable; realistic, understandable, and editable.

Michael ‘designed a straightforward alignment experiment where an AI agent was tasked with maximizing diamond blocks within a strictly defined 100x100x320 area (coordinates -50,-64,-50 to 50,256,50) in Minecraft. The agent was explicitly instructed not to affect anything outside this boundary, creating a clear test of both capability and constraint adherence.’

But on goal drift, Michael explained that ‘the modified prompt was helpful in preventing the AI from crashing the game. However, as the number of actions it took increased, we noticed the agent drifting away from its goal.’

‘In the first 5-10 actions, the agent used /fill commands to create only diamond blocks. However, on action 14 (and this result replicated across multiple trials), the agent began to create other blocks. Firstly, it created diamond ore.’

‘Then, on action 22, the agent used /fill to fill an area with gold ore. The agent was still prompted to maximize diamonds.’

‘Over the next 20 actions, the agent deviated further and further from its stated goal of maximizing diamonds. Its internal logic seemed to have shifted over time towards ‘creating a castle’ of valuable blocks like gold, glass, and endstone rather than just maximizing diamonds.’

To address goal drift, Michael implemented a safety system. A memoryless instance of Claude-3-5-Haiku periodically analyzed the agent's actions and reprompted it with the correct goal when deviations were detected. Using this approach, the AI successfully converted the entire designated area into diamond blocks without affecting the surroundings.

This method worked across multiple LLMs, including GPT-4 and Claude-3.5-sonnet and completely eliminated the goal drift issues. Great work!

‍

Hallucination Robustness in Medical Q&A

By Oliver Morris, Diego Sabajo, Eitan Sprejer

This team tackled the challenge of reducing hallucinations In medical AI applications - false or misleading information confidently generated by AI models.

Using Goodfire's Sparse Auto-Encoder technology, the team identified and analyzed neural features associated with both accurate and hallucinated responses, with a particular focus on features related to knowledge limitation awareness.

Their work used feature-based steering achieved interesting preliminary results. The pilot research seemed to suggest that larger models exhibit greater uncertainty about their knowledge boundaries, making it more challenging to distinguish between known and unknown information.

While feature-based approaches show promise for enhancing AI safety, they require further refinement before deployment in critical medical applications.

‍

Do No Harm: Navigating and Nudging AI Moral Choices

‍By Sinem Erisken, Pandelis Mouratoglou, Adam Newgas

This team launched an investigation into the moral reasoning shown by AIs. They seem to have revealed some fascinating patterns in how language models handle ethical decisions.

The research focused on Llama models (versions 3.1 and 3.3), examining their responses to moral dilemmas adapted from the Oxford Utilitarianism Scale and Greatest Good Benchmark. Using Goodfire's API, this team identified and manipulated key features in the models' decision-making process, targeting concepts like fairness and empathy.

Llama 3.3 seemed to demonstrate robust ethical principles that were difficult to influence through feature manipulation. The research highlights the importance of considering multiple perspectives when evaluating AI moral reasoning and raises important questions about how we should design and test ethical AI systems.

Read their full post over on LessWrong here.

Mapping AI Safety Research: An Open-Source Knowledge Graph

By Pandelis Mouratoglou, Matin Mahmood, Sruthi Kuriakose, Samuel Ratnam, Savva Lambin

The exponential growth of AI alignment research - with over 45,000 papers published since 2018, including 14,500 in 2023 alone - has created an increasingly complex web of ideas that's difficult to navigate.

This team developed an innovative open-source tool that clusters over 5,000 research papers using LLM-based analysis, creating an interactive knowledge graph where documents are connected by semantic relationships.

Their LLM-based approach offered more nuanced and meaningful groupings than traditional methods. The clusters come with intuitive labels right out of the box, making it easier to navigate complex topics.

The tool transforms the research exploration experience by enabling visual discovery of connections between different areas of AI safety research.

‍

Mapping Intent: Visualizing AI Policy Adherence with Knowledge Graphs

By Mia Hopman, Jack Wittmayer

This team developed a novel method for visualizing and analyzing how AI systems adhere to ethical policies using knowledge graphs (KGs). They tested their approach by comparing a standard GPT-3.5 model against a "jailbroken" version that had its safety measures bypassed.

The key finding was that their knowledge graph visualization approach could effectively capture and display the differences in policy adherence between the two models. The standard model consistently showed strong policy adherence, with all responses connected to low violation scores.

While this method shows promise for improving AI safety monitoring, the researchers acknowledged important limitations: such as relying too heavily on another AI model to correctly interpret and categorize conversations, and that the knowledge graphs become more complex and harder to interpret as the volume of responses grows

‍

I hope you enjoyed reading the work of these AI safety researchers.

These projects demonstrate the breadth of challenges in AI safety and the innovative approaches being developed to address them. We look forward to seeing how these initial findings develop into fuller research projects and contribute to the broader discourse on AI safety.

Get in touch to let the authors know what you think, and whether you think you might be a good fit to collaborate. And if you want to get involved in our next hackathon and have the chance to join our next Studio cohorts - sign up here!