Mar 10, 2025
Inspiring People to Go into RL Interp
Christopher Kinoshita, Siya Singh Deneille Guiseppi
This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.
Tarin
This was really fun to play with! The novel games and well-presented examples made this easy to follow and engaging. The design of the games itself was well-integrated as well.
One area I think you can take this further is thinking more targetedly about the level of technical understanding for your intended audience. What level of AI context did your test population come in with? The use of mathematical symbols and variables, terms like "episodes" and "q-learning" and assuming familiarity with "deep neural networks" in the RL sections I imagine could discourage a more general and non-technical audience. How could this product be further adapted to reach a wider public population?
Unfortunately, I was unable to play the game directly as I am unable to open the .exe file, but from the documentation provided, it sounds like a great entry-level intro to AI risk while having fun. Overall, this project was a great example of gamification and interactive learning to boost learner context and comprehension of especially complex (but necessary!) topics. Well done!
Hannah Betts
Clear engagement with a gap in the field. I enjoyed the interaction in the early course content (creating a blackjack agent). The examples were clear and stepped slowly through relevant concepts. Very little technical understanding was required. However, the project motivation focused on a critical gap in mechanistic interpretability for RL; the final resource included very introductory materials, and didn't explore mechanistic interpretability, or detail the process behind chain-of-thought LLMs, which might make the connection more concrete for the student.
The interactive component was very promising, and I look forward to seeing further interactive components, and a built-out course, if this project continues to develop. In particular, strengthening the content that bridges the gap, to the point where the reader can contribute to the concrete projects listed would be exciting - the terminology in the spreadsheet is clearly more complex than there was time to explain in a weekend hackathon.
Cite this work
@misc {
title={
@misc {
},
author={
Christopher Kinoshita, Siya Singh Deneille Guiseppi
},
date={
3/10/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}