Jan 20, 2025

Securing AGI Deployment and Mitigating Safety Risks

Roshni Kumari

As artificial general intelligence (AGI) systems near deployment readiness, they pose unprecedented challenges in ensuring safe, secure, and aligned operations. Without robust safety measures, AGI can pose significant risks, including misalignment with human values, malicious misuse, adversarial attacks, and data breaches.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Natalia Perez-Campanero

The proposal seems very ambitious. It does a good job of summarizing the safety risks of AGI, but is a little unclear on which of these the proposed solution would be targeting and how. To develop further, I would recommend focusing on one solution/use case and developing it further, thinking through implementation. Providing a clear definition of AGI would help in setting a foundation for a more concrete discussion of the safety risks and potential solutions/applications.

Shivam Raval

The paper describes a proposal for SentinelAI, a framework for securing AGI deployment and mitigating safety risks. The proposal identifies five different challenges and poses potential solutions that are based on techniques from existing literature. The proposal highlights an important need that would arise as autonomous systems become more and more capable, however, since it is based on 2027 AGI prediction, it might be realized in the near term. It might also be helpful if what an AGI system is comprised of, and some justification for how the existing approaches are going to be valid against a vastly capable system or how they can be modified to form the different components of SentinelAI. Some results using a simulated strong and capable advanced system (eg. o3) can be monitored using other strong or even less capable systems (a usual setup in oversight and control literature) can strengthen the proposal.

Pablo Sanzo

The paper does a good attempt at summarizing the different AI safety risks. It would have benefitted from proposing a definition for AGI, since the term is used broadly in the paper.

When it comes to the solution proposed, it is appreciated that several layers are explored, as to reduce the risks posed by AGI. The link to "Multi-Agent Simulation" seems broken (https://github.com/SentinelAI-MultiAgentSim).

I would have appreciated a deeper dive into how are frontier AI labs currently solving these challenges, and how the proposed solution can fit and be deployed within the industry (e.g. what incentives to use).

I encourage the team to continue working in this direction and, as soon as more results are available, to show a comparison between an "out-of-the-box" AI model, and one that has been risk-mitigated by SentinelAI.

Cite this work

@misc {

title={

Securing AGI Deployment and Mitigating Safety Risks

},

author={

Roshni Kumari

},

date={

1/20/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.