Token crisis: solved. ✅
We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs.
Findings:
> DLMs beat AR when tokens are limited, with >3× data potential.
> A 1B DLM trained on just 1B tokens hits 56% HellaSwag & 33% MMLU — no tricks, no cherry-picks.
> No saturation: more repeats = more gains.
🚨 ”
We also dissected the serious methodological flaws in our parallel work “Diffusion Beats Autoregressive in Data-Constrained Settings” — let’s raise the bar for open review!
🔗 Blog & details:
18 🧵s ahead:
顯示更多