Grok 4 Heavy is a version of Grok 4 that uses multiple agents. Instead of relying on just one model, agents work together simultaneously on the same task. After they produce their results, they compare their findings and agree on a final answer.
In Humanity's Last Exam, Grok 4 Heavy with tool use scored 44.4%. The system also contributed to ARC-AGI, where Grok 4 was the first model to surpass 10% and achieve 15.9%.
Comparison to GPT-5.
On HLE: Grok 4 Heavy outperforms GPT-5 High by 44.4% vs. 42%. Base Grok 4 outperforms Base GPT-5, with a 25.4% accuracy rate compared to 25.3%.
On ARC-AGI-2: Grok 4 Heavy outperforms GPT-5 High by 15.9% vs. 9.9%, doubling the prior SOTA in visual/spatial reasoning. On easier ARC-AGI-1, Grok ~66.7%, GPT-5 ~65.7%.
Show more