new forecasting benchmark: FutureSim
GPT-5.5 performs the best at 25%, but Mythos, Gemini 3.1 Pro and Opus 4.7 are not included. Based on their Brier Skill Score the models don't seem to be much better than just assigning equal probabilities to all outcomes
Since GPT-4o, frontier average scores on METR-Horizon have been remarkably predictable over time.
A simple linear fit of average score vs. release date gives R² = 0.984.
The relationship between average score and log time horizon is also extremely strong:
- p50 horizon: r = 0.998
- p80 horizon: r = 0.992
Claude Mythos scored 85.21%, slightly above the ~83.3% predicted by the pre-Mythos linear trend.
The implied doubling time for METR time horizons is still about 103 days, the same value we reported on February 12th, 2026.
If current trends continue:
- 90% score: July 7, 2026
- implied p50 horizon: 27.5 hours
- implied p80 horizon: 4.8 hours
- 95% score: September 18, 2026
- implied p50 horizon: 44.9 hours, or 1.9 days
- implied p80 horizon: 7.8 hours
- 100% score: November 30, 2026
- implied p50 horizon: 73.4 hours, or 3.1 days
- implied p80 horizon: 12.8 hours