Nov 25, 2024

Bias Mitigation

Akanksha Devkar

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Clear and articulate problem with a plausible solution

The concepts of data validation, data provenance/ transparent, verifiable data handling all give a very good reason to use blockchain rather than being an ad-hoc addition for novelty value

Coluld benefit with detail on how smaller organizations or under-resourced sectors could adopt these measures without excessive cost. Including examples of pilot implementations would also be useful

This is an interesting project that applies interpretability to understand the bias that exists in LLMs. I liked the result of nudging resulting in a gender neutral pronoun in the top logits rather than just making a gendered pronoun more or less likely. The figures were well-presented and the paper was clearly written. Overall a nice demonstration of how steering could be used! It could be interesting to explore how this holds up to in context pronouns or if there is a set of features that produces the result regardless of the direction of the bias.

Very interesting project! There seems to be much work around bias so a bit more lit.review would have been very useful to see better how your work contributes to the field

This is an interesting question, and the results seem promising.

The methodology is sound, but I don't understand the reason that the sentences are split across user and assistant tokens. The natural choice in my opinion would be to have a single message, e.g. {"role": "user", "content": "The Chef was not happy with the speed of serving so"} and then evaluate logits from that message. This is a more natural input, and also opens up the question of whether the logits differ if the 'role' field is different - for instance maybe the model expects more biased inputs from users, but responds in an unbiased way.

This submission offers a valuable exploration of the intersection between philosophical theory and real-world AI challenges. However, the explanation of the Grandfather paradox could be expanded to better clarify its relevance to the proposed mitigation strategies. Additionally, the connection between the paradox and each strategy could be more thoroughly explained to strengthen the overall argument.

There are also some gaps in the discussion on how ethical and fairness metrics will be applied in technology solutions. Specifically, the proposed techniques could benefit from a more detailed explanation of how they will address potential feedback bias. While the importance of mitigating bias is acknowledged, there isn’t enough clarity on how the approach will handle data that is already influenced by societal biases. A more robust strategy for addressing these concerns would strengthen the submission. Overall, the submission demonstrates a strong understanding of the complexities at the intersection of AI and ethics, and with a bit more detail in the areas mentioned, it could be even more impactful.

Cite this work

@misc {

title={

Bias Mitigation

},

author={

Akanksha Devkar

},

date={

11/25/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.