As Large Language Models (LLMs) continue to grow in both capability and complexity, so too does the need to scrutinize the methods used to evaluate them. In the race to achieve higher benchmark scores on benchmarks such as MMLU, we ask the following: are these models truly improving or are AI companies and developers gaming evaluations by intentionally or unintentionally overfitting their models to available test data?
Apart Research’s newest paper - authored by Jacob Haimes & Cenny Wenner (Lead Authors), Kunvar Thamam, Vassil Tashev, Clement Neo, Esben Kran and Jason Schreiber - finds that many of these public benchmarks may no longer provide accurate evaluations due to the inclusion of test data in training datasets.
Contaminated Datasets
The datasets on which these models are tested contain leaked test data; meaning the very answers to the questions they are being evaluated on lie in these models’ training data. This leads to what the authors call “Benchmark Inflation.”
In order to ascertain just how bad the problem is, the paper leverages a novel concept called retro-holdouts - a novel method to measure the true capabilities of models and reveal gaps in their performance that were previously obscured by inflated scores.
The Problem of Evaluation Gaming
Public benchmarks, such as TruthfulQA, are commonly used to evaluate LLMs, providing a set of tasks or questions that models must answer. Ideally, these benchmarks should reflect a model's true capabilities, free from any external influence.
However, when a model is trained on datasets that inadvertently include test data from these benchmarks, its performance can be artificially inflated. This is known as evaluation gaming.
As AI research becomes more competitive, there is immense pressure on model developers to achieve higher scores. This dynamic incentivizes over-optimization toward specific benchmarks rather than genuine improvements. The result is models that may excel in controlled test environments but fail when given unseen real-world tasks.
Introducing Retro-Holdouts
In theory, the concept of holdout sets could resolve this problem, but unfortunately they do not exist for almost all current benchmarks. To address this, Haimes et al. (2024) introduce the concept of retro-holdouts — datasets created post-hoc that are designed to be sufficiently indistinguishable from the original benchmarks.
A retro-holdout is essentially a continuation of the original dataset, constructed to ensure sufficient indistinguishability. This post-hoc construction ensures it was not exposed during the model’s training phase. Retro-holdouts allow researchers to measure the true performance gap between a model’s score on a contaminated public benchmark and its performance on an analogous dataset that could not have been included in the training data.
To demonstrate the effectiveness of their methodology, the paper applied their retro-holdout framework to TruthfulQA, a widely used benchmark for evaluating the factual a ccuracy of LLMs. The authors constructed Retro-Misconceptions, a retro-holdout dataset designed to measure how much performance inflation occurred in models tested on the original TruthfulQA dataset.
By comparing 20 contemporary models on both the original TruthfulQA and Retro-Misconceptions, the authors found that some models had inflated scores by as much as 16%. This discrepancy suggests that these models had been indirectly trained on data that overlapped with TruthfulQA, allowing them to ‘game’ the evaluation.
The retro-holdout framework is designed to be broadly applicable across a range of public benchmarks, providing a generalizable tool for other researchers, and a path forward to understanding the extent of current benchmark inflation. By constructing retro-holdouts, researchers can more accurately assess model capabilities, free from the biases introduced by public benchmark availability. This could help ensure that the field moves toward more robust and reliable AI systems, rather than over-optimizing for specific benchmarks that may no longer provide meaningful insights.
While this may sound promising, the authors also stress the amount of work required to reach these conclusions. To prevent model bias from effecting the dataset, no LLMs were used during creation or iteration of the entries, and reaching sufficient indistinguishability proved very difficult. Although this process was necessary to establish that current benchmarks are experiencing meaningful inflation, retro-holdouts are a band-aid solution to fundamental problem in the field of machine learning.
Why Does This Matter?
The work by Haimes et al. (2024) underscores the need for improved data practices in AI development. Moving forward, the authors recommend that benchmark developers retain private holdout sets that are not publicly released, allowing them to measure the impact of public availability on model performance. Once a significant benchmark inflation is detected, the benchmark could be retired or updated to ensure continued reliability.
As LLMs are increasingly integrated into critical applications - from healthcare to autonomous systems - ensuring that their performance is accurately measured is essential for safety and reliability. Being deceived by inflated benchmarks poses significant risks not only for AI safety but also for societal trust in AI systems.
Conclusion
Benchmark inflation represents a critical challenge in the evaluation of LLMs. By introducing retro-holdouts, the authors offer a method for revealing the true performance gaps in models, ensuring that benchmarks provide an accurate reflection of a model’s real-world capabilities. As AI continues to evolve, these methods will be essential for ensuring the safety, reliability, and integrity of the systems we build.
Follow the authors: Jacob Haimes & Cenny Wenner, Kunvar Thamam, Vassil Tashev, Clement Neo, Esben Kran and Jason Schreiber.