Register and share your invite link to earn from video plays and referrals.

Qwen
@Alibaba_Qwen
Open foundation models for AGI.
Joined February 2024
6 Following    209.4K Followers
๐Ÿš€ Introducing FlashQLA: high-performance linear attention kernels built on TileLang. โšก 2โ€“3ร— forward speedup. 2ร— backward speedup. ๐Ÿ’ป Purpose-built for agentic AI on your personal devices. ๐Ÿ’กKey insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2ร—+ kernel-level speedups. We hope this is useful to the community!๐Ÿซถ๐Ÿซถ Learn more: ๐Ÿ“– Blog: ๐Ÿ’ป Code:
Show more
0
33
1.3K
149
Forward to community