Jan 20, 2025
Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping
Adriano, David Baek, Erik Nordby, Emile Delcourt
Summary
Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)
to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient
refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from
pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful
knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed
to unforeseen risks. This project introduces model scoping, a novel approach to apply a least
privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing
the model’s operational domain, model scoping reduces susceptibility to adversarial prompts
and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in
unpredictable, evolving environments.
Cite this work:
@misc {
title={
Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping
},
author={
Adriano, David Baek, Erik Nordby, Emile Delcourt
},
date={
1/20/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}