Jan 20, 2025

Prompt+question Shield

seon Gunness

A protective layer using prompt injections and difficult questions to guard comment sections from AI-driven spam.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

This is an interesting technical solution, addressing the significant problem of AI driven spam, with both the problem and solution well described. It is currently lacking some detail on the market need and how it would be deployed eg how would this be integrated with existing comment sections? Have you thought through how to develop APIs or SDKs for integration? I would encourage you to get an MVP to gather feedback from users. It would also be helpful to get a bit more concrete about which AI driven spam you are trying to prevent, and as well as the key benefits of this approach, and how the solution handles edge cases such as misclassified legitimate comments.

The proposal describes Prompt+question Shield, a defensive library to protect website comment sections from

automated AI-driven spam responses. This introduction of AI-driven agent systems or custom chained model pipelines that are capable of generating and posting content, can introduce noise and lead to concerns about the authenticity of online discussions. The author's proposed solution is a proactive defense mechanism using a prompt injection that detects automated agents by triggering confusion in the agent's response or nullifying posting behaviors. Additionally, the author proposes a time-limited question that might be difficult for humans to respond to promptly and prevent responses that are received below a threshold time.

This is a great idea, and the proposal sets up the problem clearly and proposes interesting solutions. There are some avenues of improvement to consider that would make the proposal stronger:

1. Discussing potential risks and failure modes: for example the solution might face some challenges if there is a preprocessing step that identifies specific divs to post content to or uses multiple agents that can evolve and learn to bypass the prompt injections.

2. Some discussion on how the solution might adapt to future AI systems with advanced capabilities that can ignore the injection attacks might be helpful

3. Demonstrating how the library can be used in some practical scenarios with detailed example usecases or case studies can show the empirical validity of the solution

I would recommend the author to develop the solution further since this is definitely going to be a real need very soon.

The proposal is immediately interesting, not only because the problem space was familiar to me, but also because the author introduces it thoroughly.

The solutions proposed are quite interesting and novel. The text makes it easy to understand how, currently, a small typography, prevents these agents from identifying the text from a pure vision approach. I would have appreciated to read how this is future-proof for when the vision models improve, or when vision models are combined with language models which read the page's source code.

Finally, when it comes to productizing the solution, I would have appreciated to read more about how it can be launched and integrated with existing comment sections software. I encourage the team to keep working in this direction.

Cite this work

@misc {

title={

Prompt+question Shield

},

author={

seon Gunness

},

date={

1/20/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.