Benchmarks that test what models have memorized are saturating fast. ARC-AGI-3 is asking a harder question: can AI actually learn something new on the fly?
One direction we've been exploring: multi-agent orchestration. In our study, coordinating four frontier LLMs across multiple turns consistently matched or outperformed the strongest single model, even on tasks none of them could solve alone.
The gap between "best single model" and "best coordination of models" is where a lot of the real progress is hiding.
More on our multi-turn, multi-agent orchestration study:
Announcing ARC-AGI-3
The only unsaturated agentic intelligence benchmark in the world
Humans score 100%, AI <1%
This human-AI gap demonstrates we do not yet have AGI
Most benchmarks test what models already know, ARC-AGI-3 tests how they learn