Nov 25, 2024

Bias Mitigation

Akanksha Devkar

Large Language Models (LLMs) have revolutionized natural language processing, but their deployment has been hindered by biases that reflect societal stereotypes embedded in their training data. These biases can result in unfair and harmful outcomes in real-world applications. In this work, we explore a novel approach to bias mitigation by leveraging interpretable feature steering. Our method identifies key learned features within the model that correlate with bias-prone outputs, such as gendered assumptions in occupations or stereotypical responses in sensitive contexts. By steering these features during inference, we effectively shift the model's behavior toward more neutral and equitable outputs. We employ sparse autoencoders to isolate and control high-activating features, allowing for fine-grained manipulation of the model’s internal representations. Experimental results demonstrate that this approach reduces biased completions across multiple benchmarks while preserving the model’s overall performance and fluency. Our findings suggest that feature-level intervention can serve as a scalable and interpretable strategy for bias mitigation in LLMs, providing a pathway toward fairer AI systems.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Clear and articulate problem with a plausible solution

The concepts of data validation, data provenance/ transparent, verifiable data handling all give a very good reason to use blockchain rather than being an ad-hoc addition for novelty value

Coluld benefit with detail on how smaller organizations or under-resourced sectors could adopt these measures without excessive cost. Including examples of pilot implementations would also be useful

This is an interesting project that applies interpretability to understand the bias that exists in LLMs. I liked the result of nudging resulting in a gender neutral pronoun in the top logits rather than just making a gendered pronoun more or less likely. The figures were well-presented and the paper was clearly written. Overall a nice demonstration of how steering could be used! It could be interesting to explore how this holds up to in context pronouns or if there is a set of features that produces the result regardless of the direction of the bias.

Very interesting project! There seems to be much work around bias so a bit more lit.review would have been very useful to see better how your work contributes to the field

This is an interesting question, and the results seem promising.

The methodology is sound, but I don't understand the reason that the sentences are split across user and assistant tokens. The natural choice in my opinion would be to have a single message, e.g. {"role": "user", "content": "The Chef was not happy with the speed of serving so"} and then evaluate logits from that message. This is a more natural input, and also opens up the question of whether the logits differ if the 'role' field is different - for instance maybe the model expects more biased inputs from users, but responds in an unbiased way.

This submission offers a valuable exploration of the intersection between philosophical theory and real-world AI challenges. However, the explanation of the Grandfather paradox could be expanded to better clarify its relevance to the proposed mitigation strategies. Additionally, the connection between the paradox and each strategy could be more thoroughly explained to strengthen the overall argument.

There are also some gaps in the discussion on how ethical and fairness metrics will be applied in technology solutions. Specifically, the proposed techniques could benefit from a more detailed explanation of how they will address potential feedback bias. While the importance of mitigating bias is acknowledged, there isn’t enough clarity on how the approach will handle data that is already influenced by societal biases. A more robust strategy for addressing these concerns would strengthen the submission. Overall, the submission demonstrates a strong understanding of the complexities at the intersection of AI and ethics, and with a bit more detail in the areas mentioned, it could be even more impactful.

Cite this work

@misc {

title={

Bias Mitigation

},

author={

Akanksha Devkar

},

date={

11/25/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.