Jan 19, 2025

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

Aisha Gurung

The AI Safety Society is a centralized platform designed to support alumni of AI safety programs, such as SPAR, MATS, and ARENA, as well as independent researchers in the field. By providing access to resources, mentorship, collaboration opportunities, and shared infrastructure, the Society empowers its members to advance impactful work in AI safety. Through institutional and individual subscription models, the Society ensures accessibility across diverse geographies and demographics while fostering global collaboration. This initiative aims to address current gaps in resource access, collaboration, and mentorship, while also building a vibrant community that accelerates progress in the field of AI safety.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Natalia Perez-Campanero

Personally, I would love to see this succeed - I appreciate the emphasis on community-driven decision-making and resource sharing. The market research questionnaire should provide a solid foundation for building a platform that users would want to use. It is unclear the extent to which users would adopt this over existing alumni initiatives. It would be helpful to explore in more detail how the platform will be sustained and maintained in the long term, as this is often an issue with similar initiatives. I would also recommend establishing a system for reviewing and curating resources and knowledge shared on the platform to maintain its integrity and relevance.

Edward Yee

There are various programs that are launched by various fieldbuilders to solve this problem already. They have existing communities and opportunities for these people to join. In fact, a bigger challenge has been finding enough good talent to join them.

Shivam Raval

This proposal aims to start The AI Safety Society, a centralized platform to support alumni of AI safety programs (like SPAR, MATS, and ARENA) and independent researchers. The author highlights some post-training challenges that the authors face, such as limited resource access, fragmentation, and mentorship, leading to decreased talent retention. AI Safety Society aims to bridge the gap by providing further academic and networking opportunities, and support towards fostering soft skills and encouragement of good community members. This would be done by collaborating with institutions with active AI safety groups and providing additional resources and mentoring opportunities, fostering a global collaboration network.

This is a great initiative and would be beneficial for the AI safety community on a global scale. I want to point out that due to the nature of the initiative geared towards community building and providing access to resources, a non-profit business model might be better suited for this initiative. Further details on the estimate of the budget and potential funding agencies and grants would also have strengthened the proposal.

Finn Metz

I definitely agree with the premise that there are many folks going through these programs who ultimately end up in a limbo afterwards without finding a full-time postion. I am not sure though if making these folks more comfortable afterwards is the right approach. If you would want to be cynical you could frame your platform as like the "wellfare for underperforming AIS researchers". So I wouldnt want to have these folks just hanging around and socializing after their programs, but rather they should as quickly as possible go out and do something, they should try hard to find a full-time position, become an independent researcher (maybe the platform could help here somewhat), or better even, they should start their own orgs, and hire AIS folks and make the whole ecosystem larger. So maybe if you could frame your platform as like an interims solution that really gives them this perspective of "Try your hardest to apply at these impactful positions, go do research independently and use these resources for getting funding etc., or start your own org by doing e.g. catalyze" then I would be onboard and see good potential. Just spending money for people to linger around for longer doesnt seem helpful.

Cite this work

@misc {

title={

Building Bridges for AI Safety: Proposal for a Collaborative Platform for Alumni and Researchers

},

author={

Aisha Gurung

},

date={

1/19/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.