Jan 20, 2025

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

Adriano, David Baek, Erik Nordby, Emile Delcourt

Details

Details

Arrow
Arrow
Arrow

Summary

Even with Reinforcement Learning from Human or AI Feedback (RLHF/RLAIF)

to avoid harmful outputs, fine-tuned Large Language Models (LLMs) often present insufficient

refusals due to adversarial attacks causing them to revert to reveal harmful knowledge from

pre-training. Machine unlearning has emerged as an alternative, aiming to remove harmful

knowledge permanently, but it relies on explicitly anticipating threats, leaving models exposed

to unforeseen risks. This project introduces model scoping, a novel approach to apply a least

privilege mindset to LLM safety and limit interactions to a predefined domain. By narrowing

the model’s operational domain, model scoping reduces susceptibility to adversarial prompts

and unforeseen misuse. This strategy offers a more robust framework for safe AI deployment in

unpredictable, evolving environments.

Cite this work:

@misc {

title={

Scoped LLM: Enhancing Adversarial Robustness and Security Through Targeted Model Scoping

},

author={

Adriano, David Baek, Erik Nordby, Emile Delcourt

},

date={

1/20/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Review

Review

Arrow
Arrow
Arrow

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.