really cool benchmark for long-horizon test-time adaptation
gpt-5.5 in codex leads on FutureSim, where agents interact with a chronological replay of real-world news and are tasked with predicting future events
on some Polymarket questions, gpt-5.5 even moved ahead of the human market aggregate
interestingly, gemini 3.1 and opus 4.7 are missing
顯示更多