Nov 2, 2025
A Critique of METR's Time Horizon Forecasting
Jake Ushida, Mohammad Rehman
We examine the methodology of METR’s Measuring AI Ability to Complete Long Tasks, a paper that measures the time horizon of AI systems on tasks of varying length to forecast AI automation in the future. We focus on the three benchmarks that were used to evaluate human and AI performance in the paper: HCAST, SWAA and RE-Bench. For each benchmark, we found limitations that affect the predictions of METR’s paper: SWAA is limited to quick knowledge recall, HCAST remains within text-based settings, and RE-Bench simplifies complex outcomes into broad categories. We believe improving upon these limitations is needed to better forecast AI automation in the future.
* In general, I love works that rigorously review preexisting literature as you have done here; we need more academic auditing!
* While this is a well-reasoned and structured critique of METR's paper, it does not present any solutions to the underlying problem, or ways we could account for these shortcomings in reported results.
* I would strongly encourage a blogpost sharing this work, as it is important to call out these flaws when we find them.
Cite this work
@misc {
title={
(HckPrj) A Critique of METR's Time Horizon Forecasting
},
author={
Jake Ushida, Mohammad Rehman
},
date={
11/2/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


