Neuroplasticity in LLMs

Advances in model editing promise to remove undesirable concepts from large language models by pruning neurons. Is this method of model editing truly robust? Can models recover unwanted concepts after pruning - and if so, how?

Ret(r)aining Concepts

To explore how LLMs relearn removed concepts, we 1) fine tune a model on a dataset for named entity recognition, 2) prune neurons linked to a specific concept, and 3) retrain the model on the same dataset, observing how the concept's representation evolves during retraining.

Difference in retraining speed after ablating the neurons based on their concept saliency (green) or randomly (grey)

From Relation to Activation

Results show that models demonstrate a remarkable capacity to adapt and regain conceptual representations. We call this phenomenon “neuroplasticity”. Neuroplasticity happens quickly, allowing the model to regain performance within a few epochs of retraining. This implies that undesirable concepts can easily reappear in edited models if retraining takes place.

After pruning, the neurons that take over the role of the removed neurons (high saliency) were already representing similar concepts (high similarity) before pruning

Redistributing Concepts

How does neuroplasticity happen? We find that pruned concepts initially represented in later layers of LLMs are redistributed to neurons in earlier layers. Based on how strongly these neurons activate on concept related tokens, and the similarity between the concepts represented before and after pruning in these neurons, we hypothesise that removed concepts are relearned by neurons which originally captured similar concepts.

Pruned concepts, initially present in later layers, are redistributed to earlier layers.

For example, when pruning the concept of location names, neurons which originally represented the concept of people names are more likely to relearn the removed concept. Neurons which relearn removed concepts also exhibit polysemantic properties, and activate on both tokens related to the removed concept and tokens related to their original concept.

Michelle Lo
January 9, 2024

An interview with

Michelle Lo

"
Large Language Models Relearn Removed Concepts
" was written by
Michelle Lo

What inspired you to embark on this particular project?

I have some interest in neuroscience, and came across some fascinating cases of people recovering abilities after brain damage in my reading. I wondered if the same phenomenon might be replicated in artificial neural networks, sparking this investigation!

What was the biggest challenge you faced during this project, and how did you overcome it?

Setting up the code to run experiments was the biggest challenge. Trial and error taught me to incrementally run snippets of code and check that I was using the correct configurations before running hour-long experiments.

How do you see your work influencing or changing the current landscape of your field?

Understanding concept representation, redistribution, and recapture is key to developing safer, fairer models. Investigating how models recover removed concepts also boosts their robustness, especially in recovering from partial damage and information loss.

What advice would you give to fellow researchers or students interested in pursuing a similar path?

Designing and writing a research paper - even if you have no prior experience - is a great opportunity to see what it would be like to contribute to the field. Take the opportunity, even if you’re not sure if it will work out!

Based on your findings, what future research directions do you find most promising or necessary?

Our results suggest that there is a correlation between neurons which captured similar concepts and neurons which relearn removed concepts, but I’d like to see further investigation into if there is a cause-and-effect relationship here. It would also be interesting to see how neuroplasticity is demonstrated in real, deployed models that have been edited to remove unwanted concepts.

Author contribution

Michelle Lo*
Designed the research question, ran the main experiments, and analyzed the results
Shay B. Cohen
Assisted with the writing and provided reviews of the project
Fazl Barez
Advised and oversaw the project, did the writing, and helped with direction and experimental design

Citation

@misc{lo2024large, title={Large Language Models Relearn Removed Concepts}, author={Michelle Lo and Shay B. Cohen and Fazl Barez}, year={2024}, eprint={2401.01814}, archivePrefix={arXiv}, primaryClass={cs.AI} }

Send feedback

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Media kit

Quotes

Neuroplasticity has significant implications for model editing. Model editing promises to remove undesirable concepts, but neuroplasticity implies that those concepts may in fact reappear if retraining takes place.
Liked by
and
others

All figures

Process of investigating neuroplasticity in a large language model. We identify concept neurons (dark blue) in the base model, and prune them (white). We then retrain the model until it regains its original performance and identify new concept neurons.
 Mean concept saliency for the concept of location names, for neurons across different layers of a baseline DistilGPT2 model throughout the process of neuroplasticity, after pruning random neurons
No items found.