Dear Apart Community,
Welcome to our newsletter - Apart News!
At Apart Research there is so much brilliant research, great events, and countless community updates to share. In this week's Apart News we look over promising new LAT research and get a Hackathon insider's account from Archana.
LAT Proves Promising
Research and blog by Nora Petrova, Hélios Lyons & Alexandra Abbas.
As Apart Lab Fellows, we investigated how Latent Adversarial Training (LAT), as a safety fine-tuning method, affects the representation of refusal behaviour in language models compared to standard Supervised Safety Fine-Tuning (SSFT) and Embedding Space Adversarial Training (AT). We found that LAT appears to encode refusal behaviour in a more distributed way across multiple SVD components in the model's latent space.
Additionally, refusal vectors computed from the LAT model resulted in more effective refusal ablation attacks across our three models, leading to lower refusal rates compared to vectors from the other models. Despite this, the LAT model maintained the highest refusal rates, making it the most robust out of the three models against such attacks. However LAT's better encoding of refusal behaviour could be exploited and result in more successful refusal attacks.
Read our write up on Apart's blog here.
Introducing Latent Adversarial Training
Latent Adversarial Training (LAT) is a recently proposed technique that introduces perturbations in a model's hidden layers, providing defences against threats like adversarial attacks and trojans without requiring specific failure examples. We investigated how LAT affects the representation of refusal behaviour by comparing it with supervised safety fine-tuning (SSFT) and embedding space adversarial training (AT) against an attack that compromises the model's ability to refuse harmful requests.
We compared these approaches by computing a "refusal direction" from contrasting pairs of harmful and harmless instructions from the AdvBench and Alpaca datasets, respectively. Our results show that LAT changes how refusal behaviour is encoded in the latent space, concentrating it more in the first two SVD components which account for a greater proportion of variance compared to the reference models. This altered representation results in a more effective ablation attack vector when generated from the LAT model.
Conclusions
Our investigation examined refusal rates post-ablation and explored latent space representations by analysing the linear separability of activation projections for harmful-harmless pairs, the SVD components of activation difference vectors, and the transferability of attack vectors. The results provide new insights into how different fine-tuning approaches impact the representation and robustness of refusal behaviour.
Our findings show that LAT alters how refusal behaviour is represented in the latent space. SVD components, computed from activation difference vectors of harmful-harmless instruction pairs, reveal that LAT concentrates the representation more strongly in the first two components compared to SSFT and AT, which encode refusal in a single SVD component.
Despite this, LAT demonstrates greater robustness to ablation attacks when the same refusal vector is used across all three models, whether the vector is obtained from the baseline or the LAT model. However, when ablated using self-generated vectors, LAT’s “precise” refusal representation makes it more vulnerable compared to SSFT and AT.
We theorize that LAT’s use of perturbations during training enables the model to explore a broader space, allowing it to learn a more comprehensive representation of refusal behaviour. This representation produces a more effective refusal vector compared to those from other models. Notably, LAT remains the most robust across all models when ablated using its own vector.
However, these results also highlight a critical vulnerability: LAT’s refined representation of refusal makes it particularly susceptible to ablation attacks. Future work could focus on addressing this vulnerability while preserving LAT’s ability to create robust and effective latent representations.
Archana: Hackathon Insider
Last October, Johns Hopkins University hosted its first AI Policy Hackathon in partnership with us at Apart Research, bringing together technologists, policy enthusiasts, and students.
Our Community Program Manager, Archana, put together an insider's account of what participants and organizers had to say about bridging the gap between technology and policy.
"It's been a wonderful journey... spending a fruitful weekend in DC," shares Abe, one of the student organizers, capturing the excitement of bringing together tech enthusiasts and policy minds under one roof. From VR demonstrations to engaging workshops, the event pulsed with innovation.
What happens when you put policy experts and technologists in one room? Magic, according to Said, a participant who witnessed something remarkable: "What really caught me off guard was how naturally everyone stepped into exactly the roles we needed, with almost no formal direction. It felt organic."
This natural alignment wasn't just luck. The hackathon's hybrid format allowed participants from Colombia, the UK, US, and Poland to collaborate seamlessly, bringing diverse perspectives to thorny AI governance challenges. For some teams, like Said's, the hackathon was just the beginning—they're now working with the Dominican Republic's government on AI policy development.
Traditional hackathons conjure images of energy drinks and frantic coding sessions. But this was different. As another participant, Shaun put it: "I have always believed that AI and technology are ultimately as much social challenges as they are technical ones, and this experience truly brought that belief to life."
Think less about racing to build a prototype, and more about crafting frameworks that could shape how AI integrates into society. Teams grappled with questions that went beyond functionality: How would these policies affect different communities? What are the long-term implications? How do we balance innovation with responsibility?
Nathan, a freshman organizer, shares his journey: "Coordinating with OpenAI and Microsoft for sponsorships... having the opportunity to work with industry experts is an experience that I won't forget." His story shows how organizing such events can open doors to industry connections even before graduation.
This wasn't just another line on a resume—it was a crash course in bringing together industry leaders, academics, and policy makers for a common cause.
The success of JHU's hackathon points to something bigger: the growing need for spaces where technical innovation and policy thinking can collide productively. In a world where AI capabilities are advancing daily, we need more events that bridge this gap.
As Abe shared while reminiscing about the VR demos and workshop discussions: "We pulled up months of efforts and made many friends along the way." And isn't that what it's all about? Building connections, bridging divides, and working together to shape the future of AI governance.
The ripples from this hackathon continue to spread. Teams are implementing their solutions in real-world contexts, and participants are taking their learnings back to their universities and organizations. As the field of AI policy continues to evolve, events like these become increasingly crucial in shaping thoughtful, effective governance framework.
AI Safety & Assurance Startup Hackathon
AI Safety requires ambition. We are facing critical technological problems in AGI deployment during the next three years; alignment, multi-agent risk, compute security, exfiltration, among many others. Each of these questions deserves a competent team to scale science-informed solutions for.
Join us on 17th-20th January where we will join you to kick off an ambitious journey into AI safety and security with other aligned and talented individuals from both science and business. Sign up here.
Opportunities
- Never miss a Hackathon by keeping up to date here!
- Applications are now open for the Cooperative AI Summer School, which will take place from 9th to 13th July 2025 in Marlow, near London.
- AI Safety Fund opportunities here to work on pressing problems in AI safety.
Have a great week and let’s keep working towards safe and beneficial AI.