Mar 10, 2025

Inspiring People to Go into RL Interp

Christopher Kinoshita, Siya Singh Deneille Guiseppi

This project is attempting to complete the Public Education Track, taking inspiration from ideas 1 and 4. The journey mapping was inspired by bluedot impact and aims to create a course that helps explain the need for work to be done in Reinforcement Learning (RL) interp, especially in the problems of reward hacking and goal misgeneralization. The point of the game is to make a humorous example of what could happen due to a lack of AI safety (not specifically goal misalignment or reward hacking) and is meant to be a fun introduction for nontechnical people to even care about AI safety.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Tarin

This was really fun to play with! The novel games and well-presented examples made this easy to follow and engaging. The design of the games itself was well-integrated as well.

One area I think you can take this further is thinking more targetedly about the level of technical understanding for your intended audience. What level of AI context did your test population come in with? The use of mathematical symbols and variables, terms like "episodes" and "q-learning" and assuming familiarity with "deep neural networks" in the RL sections I imagine could discourage a more general and non-technical audience. How could this product be further adapted to reach a wider public population?

Unfortunately, I was unable to play the game directly as I am unable to open the .exe file, but from the documentation provided, it sounds like a great entry-level intro to AI risk while having fun. Overall, this project was a great example of gamification and interactive learning to boost learner context and comprehension of especially complex (but necessary!) topics. Well done!

Hannah Betts

Clear engagement with a gap in the field. I enjoyed the interaction in the early course content (creating a blackjack agent). The examples were clear and stepped slowly through relevant concepts. Very little technical understanding was required. However, the project motivation focused on a critical gap in mechanistic interpretability for RL; the final resource included very introductory materials, and didn't explore mechanistic interpretability, or detail the process behind chain-of-thought LLMs, which might make the connection more concrete for the student.

The interactive component was very promising, and I look forward to seeing further interactive components, and a built-out course, if this project continues to develop. In particular, strengthening the content that bridges the gap, to the point where the reader can contribute to the concrete projects listed would be exciting - the terminology in the spreadsheet is clearly more complex than there was time to explain in a weekend hackathon.

Cite this work

@misc {

title={

Inspiring People to Go into RL Interp

},

author={

Christopher Kinoshita, Siya Singh Deneille Guiseppi

},

date={

3/10/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.