Dear Apart Community,
Welcome to our newsletter - Apart News!
At Apart Research there is so much brilliant research, great events, and countless community updates to share. This week we reveal our AI Startup Hackathon winners and have a look at the Apart Lab paper just accepted to ICLR.
AI Entrepreneurship Hackathon
Our AI Entrepreneurship Hackathon had an incredible buzz, and teams riffed with one another on ideas about how AI safety startups can make the world a safer place. Below we go through the Top 4 winning teams:
1) AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation by Jacob Arbeid, Jay Bailey, Sam Lowenstein, Jake Pencharz (all from the UK's AI Safety Institute).
2) Prompt+Question Shield by Seon Gunness.
3) AI Risk Management Assurance Network (AIRMAN) by Aidan Kierans.
4) Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping by Adriano, David Baek, Erik Nordby, Emile Delcourt.
***
[1st] AntiMidas: Building Commercially-Viable Agents for Alignment Dataset Generation
Submission link here.
'AI alignment lacks high-quality, real-world preference data needed to align agentic superintelligent systems. Our technical innovation builds on Pacchiardi et al. (2023)’s breakthrough in detecting AI deception through black-box analysis. We adapt their classification methodology to identify intent misalignment between agent actions and true user intent, enabling real-time correction of agent behavior and generation of valuable alignment data.'
'We commercialise this by building a workflow that incorporates a classifier that runs on live trajectories as a user interacts with an agent in commercial contexts. This creates a virtuous cycle: our alignment expertise produces superior agents, driving commercial adoption, which generates increasingly valuable alignment datasets of trillions of trajectories labelled with human feedback: an invaluable resource for AI alignment.'
By Jacob Arbeid, Jay Bailey, Sam Lowenstein, Jake Pencharz (from the UK's AI Safety Institute).
***
[2nd] Prompt+Question Shield
Submission link here.
Seon implemented a 'protective layer using prompt injections and difficult questions to guard comment sections from AI-driven spam.'
By Seon Gunness
***
[3rd] AI Risk Management Assurance Network (AIRMAN)
'The AI Risk Management Assurance Network (AIRMAN) addresses a critical gap in AI safety: the disconnect between existing AI assurance technologies and standardized safety documentation practices. While the market shows high demand for both quality/conformity tools and observability/monitoring systems, currently used solutions operate in silos, offsetting risks of intellectual property leaks and antitrust action at the expense of risk management robustness and transparency. This fragmentation not only weakens safety practices but also exposes organizations to significant liability risks when operating without clear documentation standards and evidence of reasonable duty of care.'
'Our solution creates an open-source standards framework that enables collaboration and knowledge-sharing between frontier AI safety teams while protecting intellectual property and addressing antitrust concerns. By operating as an OASIS Open Project, we can provide legal protection for industry cooperation on developing integrated standards for risk management and monitoring. The AIRMAN is unique in three ways: First, it creates a neutral, dedicated platform where competitors can collaborate on safety standards. Second, it provides technical integration layers that enable interoperability between different types of assurance tools. Third, it offers practical implementation support through templates, training programs, and mentorship systems.'
'The commercial viability of our solution is evidenced by strong willingness-to-pay across all major stakeholder groups for quality and conformity tools. By reducing duplication of effort in standards development and enabling economies of scale in implementation, we create clear value for participants while advancing the critical goal of AI safety.'
By Aidan Kierans
***
[4th] Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping
Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF) to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed to unforeseen risks.
This project introduces model scoping, a novel approach to apply a least privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing the model’s operational domain, model scoping reduces susceptibility to adversarial prompts and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in unpredictable, evolving environments.
By Adriano, David Baek, Erik Nordby, Emile Delcourt
Never miss a hackathon here, by signing up to our next one and also signing up to our newsletter.
New Paper Accepted to ICLR 2025
Our paper DarkBench: Benchmarking Dark Patterns in Large Language Models has just been accepted to ICLR 2025 Conference! We are super excited to head to Singapore - where our current Research Assistant, Clement, is residing!
DarkBench is a comprehensive benchmark for detecting dark design patterns—manipulative techniques that influence user behavior—in interactions with Large Language Models (LLMs). Our benchmark comprises 660 prompts across six categories: brand bias, user retention, sycophancy, anthropomorphism, harmful generation, and sneaking.
We evaluate models from five leading companies (OpenAI, Anthropic, Meta, Mistral, Google) and find that some LLMs are explicitly designed to favor their developers' products and exhibit untruthful communication, among other manipulative behaviors. Companies developing LLMs should recognize and mitigate the impact of dark design patterns to promote more ethical Al.
Read the submission here!
Opportunities
- Never miss a Hackathon by keeping up to date here!
- If you're a founder starting an AI safety company, consider applying to seldonaccelerator.com.
- Our friends at Goodfire are hiring for multiple roles.
- METR wants to measure "some humans to set the baseline, specifically in software engineering, cybersecurity, and ML tasks (and you should have some experience in these fields). Pay is $100/hour base rate plus bonuses for performance; time commitment is ~16 hours before the end of January. Remote contracting job, very flexible...' Link here.
- Def/Acc Salon: Navigating AI Risk & Possibility with Defensive Acceleration event in London.
Have a great week and let’s keep working towards safe and beneficial AI.