PULSE made weight sync 100x faster. That turned the trainer itself into the bottleneck.
@erfan_mhi just fixed that too. Grail's GRPO trainer is now 1.8x faster on a single B200: 27% to 47% MFU, epoch time nearly halved.
Decentralized post-training is converging on centralized speed.