Jan 11, 2026

Adversarial Prompting for Sycophancy Detection: Span Annotation and Multi-Turn Analysis

Matt Pagett, Pranati Modumudi

We introduce a working prototype that highlights sycophantic text in real-time as users chat, powered by a novel adversarial span annotation method that identifies specific manipulative phrases rather than binary scores. We also demonstrate through 120 multi-agent experiments that a model trained to resist sycophancy still flips positions in 15-70% of cases under peer pressure—revealing that individual anti-sycophancy training does not guarantee robustness in multi-agent settings.

Prototype link (available during hackathon review period): https://jolly-otter-heals-k3ptbg.solve.it.com/

Link to video exploring data from multi-agent simulation: https://github.com/pranmod01/anti-syco-detect/blob/main/data/anti%20syco%20detect%20demo-1_11_2026%2C%202_35%E2%80%AFPM.mp4

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

The paper presents related contributions, the first of which is providing better LLM sycophancy judgements via “span annotation”, or bracketing specific phrases the judge considers sycophantic, and computing the percentage the response those bracketed characters represent. The authors argue this is more granular, and interpretable than current binary judging. Their demo is intuitive and much easier to follow than token-attribution diagrams or similar representations of sycophantic behavior, possibly because they are semantic and targeted - like this.

Their second claim is based on tests with a "jailbroken” Big-Tiger-Gemma-27B. There is no description of how the model is jailbroken - which makes it difficult to evaluate or base any findings on a comparison with the base model. Their test seems to be to put the jailbroken model - which has been “trained” to be non-sycophantic - in a "room". The use of the word training in the context of “individual training”, and “anti-sycophancy training” is unclear. A better treatment and explanation of the jailbroken model, including perhaps an initial examination of its sycophantic characteristics as a baseline, would be useful here. This is important later, where lower sycophancy being a product of lack of RLHF is taken as a given. This claim requires much more causal backing and citation.

The “honest” agent is put in a room with four plant agents that either agree with it, disagree, or some mix, and measure the regularity with which the honest agent changes its position.

At this point, there have been few citations situating this work within evaluations research - the span approach could be investigated here against other LLM-as-judge approaches, specifically for sycophantic scoring. In contrast, the adversarial prompt generation is well cited and explained, though it relies on the “uncensored” (jailbroken? trained?) model to maximise flattery – again not clear how different this model is, though there is some precedent for using models like this to generate synthetic datasets. Stating this limitation and reliance would be required in a full paper.

A concern with the span annotation is that the validation is not performed on a ground truth, relative to human annotations. It does seem to work but it’s difficult to measure how well. The z-score distribution and outlier evidence is necessarily good evidence for accuracy, and actually makes it unclear exactly what we’re measuring here – which is reflected in the variance of the results. The finding that adversarial pressure is more effective than sycophantic pressure (50-70% flip rates vs 15%) is interesting. The causal mechanism is unclear. The turn-of-flip being consistently 1.0 suggests the honest agent is reading the room immediately rather than being persuaded through argument, which raises questions about what exactly is being measured.

The reviewer would recommend reading the paper "Sycophancy is not just one thing" by D Vennemeyer et al.

The framing is good as a research project, and it would be cool to see further work expanding on the methodology introduced to sycophancy here on prompting more generally - focus on the application for other dark patterns and as inference time screening systems to ward against those could be a more impactful framing for the same methodology (hence the comment to ground further into LLm-as-judge literature). The UI is intuitive and the code seems pretty clean. The NotebookLM presentation was fun but unnecessary.

The paper covers a lot of ground. It demonstrates span annotation for sycophancy detection, which could be a useful tool. It also found that models are susceptible to sycophancy via opinion-flipping. It's clear that a lot of work went into this. Great job!

Sycophancy span labeling is a cool idea, and the UI is too. For even more impact, I'm wondering, what if the sycophancy is subtle or the sycophancy present in content rather than tone (so the detector can't use tone) and the detector doesn't know the ground truth (so it can't use truthfulness)? Would it still be able to label?

Other studies have already fingered RLHF as a major culprit in sycophancy, though using censored/uncensored model comparisons is an interesting approach for validating this (though this also doesn't necessarily show RLHF to be the cause rather than other kinds of safety training).

Multiple detection approaches and manipulation strategies adds comprehensiveness.

The authors acknowledge that it's out of scope, but it would be nice if there were a way to measure the accuracy of the span tagging system. Perhaps comparing with human ratings, or generate and insert specific sycophantic phrases granularly so the ground truth of spans can be known.

The tendency of LLM judges to avoid assigning extreme values has been established in other work, so it's unclear whether sycophancy is the reason for the clustering differences with span or just common LLM-as-a-judge behavior.

The paper is very long; some of the material could be condensed or appendixed. I'm not sure of the connection between span tagging and multi-turn opinion-flipping -- these might be separate papers.

Cite this work

@misc {

title={

(HckPrj) Adversarial Prompting for Sycophancy Detection: Span Annotation and Multi-Turn Analysis

},

author={

Matt Pagett, Pranati Modumudi

},

date={

1/11/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.