Advances in model editing promise to remove undesirable concepts from large language models by pruning neurons. Is this method of model editing truly robust? Can models recover unwanted concepts after pruning - and if so, how?
To explore how LLMs relearn removed concepts, we 1) fine tune a model on a dataset for named entity recognition, 2) prune neurons linked to a specific concept, and 3) retrain the model on the same dataset, observing how the concept's representation evolves during retraining.
Results show that models demonstrate a remarkable capacity to adapt and regain conceptual representations. We call this phenomenon “neuroplasticity”. Neuroplasticity happens quickly, allowing the model to regain performance within a few epochs of retraining. This implies that undesirable concepts can easily reappear in edited models if retraining takes place.
How does neuroplasticity happen? We find that pruned concepts initially represented in later layers of LLMs are redistributed to neurons in earlier layers. Based on how strongly these neurons activate on concept related tokens, and the similarity between the concepts represented before and after pruning in these neurons, we hypothesise that removed concepts are relearned by neurons which originally captured similar concepts.
For example, when pruning the concept of location names, neurons which originally represented the concept of people names are more likely to relearn the removed concept. Neurons which relearn removed concepts also exhibit polysemantic properties, and activate on both tokens related to the removed concept and tokens related to their original concept.
What inspired you to embark on this particular project?
I have some interest in neuroscience, and came across some fascinating cases of people recovering abilities after brain damage in my reading. I wondered if the same phenomenon might be replicated in artificial neural networks, sparking this investigation!
What was the biggest challenge you faced during this project, and how did you overcome it?
Setting up the code to run experiments was the biggest challenge. Trial and error taught me to incrementally run snippets of code and check that I was using the correct configurations before running hour-long experiments.
How do you see your work influencing or changing the current landscape of your field?
Understanding concept representation, redistribution, and recapture is key to developing safer, fairer models. Investigating how models recover removed concepts also boosts their robustness, especially in recovering from partial damage and information loss.
What advice would you give to fellow researchers or students interested in pursuing a similar path?
Designing and writing a research paper - even if you have no prior experience - is a great opportunity to see what it would be like to contribute to the field. Take the opportunity, even if you’re not sure if it will work out!
Based on your findings, what future research directions do you find most promising or necessary?
Our results suggest that there is a correlation between neurons which captured similar concepts and neurons which relearn removed concepts, but I’d like to see further investigation into if there is a cause-and-effect relationship here. It would also be interesting to see how neuroplasticity is demonstrated in real, deployed models that have been edited to remove unwanted concepts.
@misc{lo2024large, title={Large Language Models Relearn Removed Concepts}, author={Michelle Lo and Shay B. Cohen and Fazl Barez}, year={2024}, eprint={2401.01814}, archivePrefix={arXiv}, primaryClass={cs.AI} }