Nov 25, 2024
Bias Mitigation
Akanksha Devkar
Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.
Bharat
Clear and articulate problem with a plausible solution
The concepts of data validation, data provenance/ transparent, verifiable data handling all give a very good reason to use blockchain rather than being an ad-hoc addition for novelty value
Coluld benefit with detail on how smaller organizations or under-resourced sectors could adopt these measures without excessive cost. Including examples of pilot implementations would also be useful
Liv Gorton
This is an interesting project that applies interpretability to understand the bias that exists in LLMs. I liked the result of nudging resulting in a gender neutral pronoun in the top logits rather than just making a gendered pronoun more or less likely. The figures were well-presented and the paper was clearly written. Overall a nice demonstration of how steering could be used! It could be interesting to explore how this holds up to in context pronouns or if there is a set of features that produces the result regardless of the direction of the bias.
Jaime Raldua
Very interesting project! There seems to be much work around bias so a bit more lit.review would have been very useful to see better how your work contributes to the field
Tom McGrath
This is an interesting question, and the results seem promising.
The methodology is sound, but I don't understand the reason that the sentences are split across user and assistant tokens. The natural choice in my opinion would be to have a single message, e.g. {"role": "user", "content": "The Chef was not happy with the speed of serving so"} and then evaluate logits from that message. This is a more natural input, and also opens up the question of whether the logits differ if the 'role' field is different - for instance maybe the model expects more biased inputs from users, but responds in an unbiased way.
Jaye Nias
This submission offers a valuable exploration of the intersection between philosophical theory and real-world AI challenges. However, the explanation of the Grandfather paradox could be expanded to better clarify its relevance to the proposed mitigation strategies. Additionally, the connection between the paradox and each strategy could be more thoroughly explained to strengthen the overall argument.
There are also some gaps in the discussion on how ethical and fairness metrics will be applied in technology solutions. Specifically, the proposed techniques could benefit from a more detailed explanation of how they will address potential feedback bias. While the importance of mitigating bias is acknowledged, there isn’t enough clarity on how the approach will handle data that is already influenced by societal biases. A more robust strategy for addressing these concerns would strengthen the submission. Overall, the submission demonstrates a strong understanding of the complexities at the intersection of AI and ethics, and with a bit more detail in the areas mentioned, it could be even more impactful.
Cite this work
@misc {
title={
@misc {
},
author={
Akanksha Devkar
},
date={
11/25/24
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}