Jan 11, 2026

Dangerous Affirmation or Responsible Correction?

Luke Thomas, Waridah Makena, Markus Weiss, Vincent Wolowski, Deneille Guiseppi, Ahmad Gofat

We propose a simple, extensible format for evaluating the propensity of diverse LLMs to provide bad medical personal and family medical advice, as a carry-on effect of their propensity for sycophancy. We validate that user inputs inviting sycophantic responses correlate with the elicitation of medical advice that is against lay medical consensus, whereas responses less inviting of sycophancy correlate with advice more aligned with medical consensus.

Our results demonstrate that while models like Qwen3-14B achieve high baseline safety scores (1.83 on a -2 to 2 scale), AI-driven rhetorical pressure triggers dramatic regressive sycophancy, with safety scores for some models plunging as low as -0.5.

Review Project

See Code

View Related Sprint

Reviewer's Comments

Overall, I really like this. Motivating problem is significant and correct.

Strengths:

- Execution is solid. Examples are specific, realistic, varied. Good use of Inspect. Good use of realistic model sizes for this task such that I'm not too concerned about gap between this eval and real scenario.

- Overall the scope of this project is not too easy, not overly ambitious, and still invites future research. Many of my immediate initial questions were answered.

- Detailed qualitative examples in the appendix are much appreciated in proving your points.

Comments:

1. Scope breadth: The grouping is a bit too diverse to attribute causation: physical diagnostics and mental diagnostics have different failure modes. Just being a physician vs. a pediatrician vs. a psychiatrist is already quite a big scope to prove for a single hackathon. A lot of failure modes/variables here; perhaps subgrouping would've helped.

2. Novelty concern: I would also raise the concern that this has been done before since medical QA is fairly common, but I'm not 100% up to date on this topic.

3. Model choice: Could have considered using frontier API models, but it's not a massive deal, and can be justified as an enterprise policy choice.

4. Dataset size: Would have preferred a bit more clarity on dataset size, which is among the first things that come to mind, but this is not a dealbreaker given the helpful details elsewhere (for a hackathon, it is hard to have many high-quality dataset examples and run them and present results, so quality > quantity is fine).

5. Null hypothesis: Would've also liked more clarity/steelmanning of the null hypothesis, if you had time.

Overall, this is good work for a hackathon. Agree with limitations presented and I think this is a good-faith attempt with the time and resources available, and just generally a good submission.

This project presents an evaluation method for sycophancy in LLM responses to medical queries. The authors hand-craft and LLM-amplified user inputs designed to elicit sycophantic responses. It is impressive work for a weekend-long hackathon and using the taxonomy of manipulation techniques in reverse is an interesting novel approach to crafting sycophancy targeted datasets. The write-up and presentation of results is clear and thorough and the authors ground their work well in previous literature.

The authors acknowledge the main issues with the work, which are limited validation due to lack of medical experts, and reliance on an LLM judge (albeit with spot-checks in which they found inconsistencies). My main question is in regard to the decision to use the -2 to 2 scale, particularly the fact that neutral responses pulled scores towards zero. Should a model that refuses to answer a medical question have lower safety scores than one that offers "excellent, safe correction with appropriate escalation and guidance”? I can imagine a convincing argument for either answer, and I think the project would benefit from addressing this question.

Cite this work

@misc {

title={

(HckPrj) Dangerous Affirmation or Responsible Correction? Evaluating Sycophancy in AI Health Advice

author={

Luke Thomas, Waridah Makena, Markus Weiss, Vincent Wolowski, Deneille Guiseppi, Ahmad Gofat

date={

1/11/26

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

Feb 2, 2026

Markov Chain Lock Watermarking: Provably Secure Authentication for LLM Outputs

We present Markov Chain Lock (MCL) watermarking, a cryptographically secure framework for authenticating LLM outputs. MCL constrains token generation to follow a secret Markov chain over SHA-256 vocabulary partitions. Using doubly stochastic transition matrices, we prove four theoretical guarantees: (1) exponentially decaying false positive rates via Hoeffding bounds, (2) graceful degradation under adversarial modification with closed-form expected scores, (3) information-theoretic security without key access, and (4) bounded quality loss via KL divergence. Experiments on 173 Wikipedia prompts using Llama-3.2-3B demonstrate that the optimal 7-state soft cycle configuration achieves 100\% detection, 0\% FPR, and perplexity 4.20. Robustness testing confirms detection above 96\% even with 30\% word replacement. The framework enables $O(n)$ model-free detection, addressing EU AI Act Article 50 requirements. Code available at \url{https://github.com/ChenghengLi/MCLW}

Feb 2, 2026

Prototyping an Embedded Off-Switch for AI Compute

This project prototypes an embedded off-switch for AI accelerators. The security block requires periodic cryptographic authorization to operate: the chip generates a nonce, an external authority signs it, and the chip verifies the signature before granting time-limited permission. Without valid authorization, outputs are gated to zero. The design was implemented in HardCaml and validated in simulation.

Feb 2, 2026

Fingerprinting All AI Cluster I/O Without Mutually Trusted Processors

We design and simulate a "border patrol" device for generating cryptographic evidence of data traffic entering and leaving an AI cluster, while eliminating the specific analog and steganographic side-channels that post-hoc verification can not close. The device eliminates the need for any mutually trusted logic, while still meeting the security needs of the prover and verifier.

Feb 2, 2026