๐ Introducing FlashQLA: high-performance linear attention kernels built on TileLang.
โก 2โ3ร forward speedup. 2ร backward speedup.
๐ป Purpose-built for agentic AI on your personal devices.
๐กKey insights:
1. Gate-driven automatic intra-card CP.
2. Hardware-friendly algebraic reformulation.
3. TileLang fused warp-specialized kernels.
FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads.
Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads.
The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2ร+ kernel-level speedups.
We hope this is useful to the community!๐ซถ๐ซถ
Learn more:
๐ Blog:
๐ป Code: