Dear Apart Community,
Welcome to our newsletter - Apart News!
At Apart Research there is so much brilliant research, great events, and countless community updates to share.
In this week's Apart News our Research Manager and Research Assistant, Natalia Pérez-Campanero Antolín and Jacob Haimes (watch his recent Researcher Spotlight or read the transcript) are in Canada at NeurIPS 2024.
On the First Day of Christmas...
...we shared to NeurIPS 2024 our 'CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection' paper!
One of our workshop papers is CryptoFormalEvals. The new CryptoFormalEval benchmark offers a systematic way to evaluate how well Large Language Models can identify vulnerabilities in cryptographic protocols.
Here is the full write-up on our blog of the CryptoFormalEval paper. Arxiv paper here.
Paper authored by Cristian Curaba, Denis D’Ambrosi, Alessandro Minisini, & Natalia Pérez-Campanero Antolín.
'Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models'
What if an AI hides these dangerous capabilities during testing? That's 'sandbagging' - when an AI deliberately underperforms to avoid detection. Our paper develops a technique to detect this behavior!
When it comes to sandbagging, it's not just about the AI doing it on their own accord, it's also about an AI being told or trained to underperform by the model developer. Think of the VW Diesel scandal. Paper here.
Authors include Philipp Alexander Kreer, Cameron Tice, Fedor Ryzhenkov, Felix Hofstätter, Jacob Haimes, Prithviraj Singh Shahani and Teun van der Weij.
'Deceptive agents can collude to hide dangerous features in Sparse Auto Encoders'
This paper explores how AI agents in Sparse Autoencoders can collude to hide harmful features - using deceptive labeling & covert communication.
Authors Simon Lermen, Mateusz Dziemian, and our Research Manager Natalia Pérez-Campanero Antolín wrote this one.
'Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique'
This paper critiques existing approaches to evaluating AI models' cybersecurity capabilities. Finding limitations that undermine its effectiveness real security risks in AI-generated code.
Apart Blog writeup here. Arxiv paper here.
Authored by Suhas Hariharan, Zainab Ali Majid, Jaime Raldua and Jacob Haimes.
Goodfire Hackathon Winners
As AI models become more powerful, understanding their internal mechanisms is crucial for building reliable, controllable AI systems.
Our latest Hackathon with Goodfire AI saw some really innovative and promising pilot submission attempts at decoding black box AI behavior - to help us understand and steer their internal workings.
01. AutoSteer: Weight-Preserving Reinforcement Learning for Interpretable Model Control: this submission combined reinforcement learning with neural features, improving model behavior 3x while keeping its core knowledge intact.
02. Classification on Latent Feature Activation for Detecting Adversarial Prompt Vulnerabilities: this team built an AI 'immune system' that catches malicious prompts w/ 100% accuracy, while distinguishing harmful from harmless content with 83% precision.
03. Utilitarian Decision-Making in Models - Evaluation and Steering: this project mapped how AI systems make moral decisions, revealing fascinating insights about how machines process ethical choices differently from humans.
04. Steering Swiftly to Safety w/ Sparse Autoencoders: this team developed a method to selectively remove unwanted knowledge from AI systems while preserving their general capabilities.
Read our thread here on all the projects!
Opportunities
- Never miss a Hackathon by keeping up to date here!
- Our friends at Cooperative AI have a call for research proposals. Deadline is 18th January 2025.
- Submit a paper to Tokyo AI Safety Conference 2025 (TAIS) - submission deadline is 1st January 2025.
Have a great week and let’s keep working towards safe and beneficial AI.