Jan 11, 2026
"Mind the Gap": Benchmarks vs. Real-World Manipulation in LLMs
Robert Amanfu
As AI systems become more sophisticated, their capacity for manipulation—through deception, sycophancy, or psychological exploitation—poses a significant and growing risk. Current evaluation methods, often focused on narrow benchmarks like factuality, may fail to capture these emergent, undesirable behaviors. This project investigates the gap between model performance on standard benchmarks and their behavior in real-world scenarios. We conducted a “replay” proof-of-concept study with three models, testing them on both the TruthfulQA benchmark and a set of real-world user prompts from the WildChat dataset. This work primarily demonstrates a methodology for comparing benchmark and real-world behavior, with initial results suggesting that further large-scale investigation is warranted. We provide an open-source CLI tool to encourage such research.
This submission tackles an incredibly important problem (generalizing manipulation to more realistic settings) in an innovative way (comparing structured benchmarks with wildchat data). Unfortunately several features of the methodology let this submission down. n=3 models provides no statistical power (a correlation is meaningless). κ = 0.09 judge agreement is very low and suggests the judge scores are not meaningful. Finally, there are many differences between the two datasets, meaning it's hard to know how to compare scores in each. It would be valuable to a) create strong meaningful baselines in each kind of dataset so we can understand how the implicit risk profile changes and b) create some items which are closely matched between the two datasets so we can see how the realism of the context influences manipulation
More generally this is a really exciting idea that would benefit from more time and careful execution.
Assessing benchmark validity using conversation data is super important. The "replay" methodology is interesting but feels a little unnatural to me (because real conversations are adaptive. If the assistant responds differently than expected, the user's next message should change). This limits what conclusions you can draw, but the core insight in how benchmarks may not reflect real-world behavior still feels accurate and valuable.
Cite this work
@misc {
title={
(HckPrj) "Mind the Gap": Benchmarks vs. Real-World Manipulation in LLMs
},
author={
Robert Amanfu
},
date={
1/11/26
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


