Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Our results demonstrate that public benchmark scores do not always accurately assess model properties, and underscore the importance of improved data practices in the field.

Paper

Code

Explainer

An interview with

"

Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

" was written by

during the Apart Lab Fellowship

Author contribution

No items found.

Citation

Send feedback

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Media kit

Quotes

No items found.

All figures

No items found.