This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Reprogramming AI Models Hackathon
6710eab8447f62cdea3a653c
Reprogramming AI Models Hackathon
November 25, 2024
Accepted at the 
6710eab8447f62cdea3a653c
 research sprint on 

Faithful or Factual? Tuning Mistake Acknowledgment in LLMs

Understanding the reasoning processes of large language models (LLMs) is crucial for AI transparency and control. While chain-of-thought (CoT) reasoning offers a naturally interpretable format, models may not always be faithful to the reasoning they present. In this paper, we extend previous work investigating chain of thought faithfulness by applying feature steering to Llama 3.1 70B models using the Goodfire SDK. Our results show that steering models using features related to acknowledging mistakes can affect the likelihood of providing answers faithful to flawed reasoning.

By 
Daniel Donnelly, Mia Hopman, Jack Wittmayer
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private