Haider.(@haider1):really cool benchmark for long-horizon test-time adaptation gpt-5.5 in codex leads on FutureSim, where agents interact with a chronological replay of real-world news and are tasked with predicting future events on some Polymarket questions, gpt-5.5 even moved ahead of the human market aggregate interestingly, gemini 3.1 and opus 4.7 are missing

Haider.

@haider1

together, we build an intelligent future.

加入 November 2021

3.8K 正在关注 66.3K 粉丝

Haider.@haider1

2026.05.17 22:30

really cool benchmark for long-horizon test-time adaptation gpt-5.5 in codex leads on FutureSim, where agents interact with a chronological replay of real-world news and are tasked with predicting future events on some Polymarket questions, gpt-5.5 even moved ahead of the human market aggregate interestingly, gemini 3.1 and opus 4.7 are missing

显示更多