Jan 11, 2026

ConvoGraph: Uncovering Structural Sycophancy via Heterogeneous Graph Transformers

tanzeel shaikh, hitesh kaushik, jakob widerberg

As AI systems become more sophisticated, their capacity for sycophancy—prioritizing user validation over factual truth—poses a critical threat to interaction integrity. Current detection methods rely on text semantics, often failing when the domain shifts (e.g., from medical facts to political opinions). We introduce ConvoGraph, a heterogeneous graph neural network that models conversations as relational structures rather than text sequences. By combining Heterogeneous Graph Transformers (HGT) for local interaction patterns with SGFormer for global context, ConvoGraph identifies the "structural signature" of sycophancy.

Crucially, we demonstrate Zero-Shot Domain Transfer: trained only on medical misconceptions, ConvoGraph achieves 0.70 F1 on completely unseen domains,multi-turn datasets, including political surveys and philosophical debates. This confirms that sycophancy is a structural phenomenon—a specific pattern of node dominance and edge agreement—transcending topic vocabulary. We release our open-source pipeline to enable structural safety evaluations for future LLMs.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

I like the concept of finding common sycophantic structural traits, and at face value it "seems to work." Brings up an interesting line of reasoning. Would've liked more detail in parts, and cross-examination of the datasets:

1. Architecture justification: Would've liked more detail on why this specific graph architecture works better, especially given the expectation that LLMs "just work" for natural language due to scale. Unclear if graphs actually help here.

2. Agreement vs sycophancy: Not entirely convincing that "detecting sycophancy" here is not just "detecting affirmative statements/agreement." A lot of assumption rests on this distinction.

Essentially, quite interesting approach - would've given a 5 if it had provided more convincing evidence against the risk that it's basically overfitting on consistent agreement (what if I simply grepped "you're absolutely right!"), or at least some specific causal explanation/ablation that tackles that assumption.

This project introduces the novel approach of detecting sycophancy using heterogenous graph representations rather than text classification. The idea of detecting structures such as “pressure - agree” and examining cross-domain generalizability is a fresh contribution. If this scales, it could offer new insights into sycophancy. However, given the noisiness and variability of real world data, the patterns identified here may not be reliably reproduced outside of a clean experimental set-up (an issue which the authors acknowledge in their limitations section).

The project could benefit from more detailed explanation of their methodology, particularly edge construction, and further development of their results and conclusions sections.

Cite this work

@misc {

title={

(HckPrj) ConvoGraph: Uncovering Structural Sycophancy via Heterogeneous Graph Transformers

},

author={

tanzeel shaikh, hitesh kaushik, jakob widerberg

},

date={

1/11/26

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Read More

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Read More

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.