This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
AI Safety Entrepreneurship Hackathon
67657515db3dae3e3314dae8
AI Safety Entrepreneurship Hackathon
January 20, 2025
Accepted at the 
67657515db3dae3e3314dae8
 research sprint on 

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF) to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed to unforeseen risks. This project introduces model scoping, a novel approach to apply a least privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing the model’s operational domain, model scoping reduces susceptibility to adversarial prompts and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in unpredictable, evolving environments.

By 
Adriano, David Baek, Erik Nordby, Emile Delcourt
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private