Nov 25, 2024
Bias Mitigation in LLM by Steering Features
Akanksha Devkar
To ensure that we create a safe and unbiased path to AGI, we must calibrate the biases in our LLMs. And with this goal in mind, I worked on testing Goodfire SDK and the steering features to mitigate bias in the recently help Apart Research x Goodfire-led hackathon on ‘Reprogramming AI Models’.
This is an interesting project that applies interpretability to understand the bias that exists in LLMs. I liked the result of nudging resulting in a gender neutral pronoun in the top logits rather than just making a gendered pronoun more or less likely. The figures were well-presented and the paper was clearly written. Overall a nice demonstration of how steering could be used! It could be interesting to explore how this holds up to in context pronouns or if there is a set of features that produces the result regardless of the direction of the bias.
Cite this work
@misc {
title={
Bias Mitigation in LLM by Steering Features
},
author={
Akanksha Devkar
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


