Lisan al Gaib(@scaling01):new forecasting benchmark: FutureSim GPT-5.5 performs the best at 25%, but Mythos, Gemini 3.1 Pro and Opus 4.7 are not included. Based on their Brier Skill Score the models don't seem to be much better than just assigning equal probabilities to all outcomes

Lisan al Gaib

@scaling01

lead them to paradise LisanBench: Impressum & Datenschutz:

加入 August 2024

1K 正在關注 43.9K 粉絲

Lisan al Gaib@scaling01

2026.05.16 14:09

new forecasting benchmark: FutureSim GPT-5.5 performs the best at 25%, but Mythos, Gemini 3.1 Pro and Opus 4.7 are not included. Based on their Brier Skill Score the models don't seem to be much better than just assigning equal probabilities to all outcomes