๊ฐ€์ž… ํ›„ ์ดˆ๋Œ€ ๋งํฌ๋ฅผ ๊ณต์œ ํ•˜๋ฉด ๋™์˜์ƒ ์žฌ์ƒ ๋ฐ ์ดˆ๋Œ€ ๋ณด์ƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Qwen
@Alibaba_Qwen
Open foundation models for AGI.
๊ฐ€์ž… February 2024
6 ํŒ”๋กœ์ž‰ ์ค‘    209.4K ํŒฌ
๐Ÿš€ Introducing FlashQLA: high-performance linear attention kernels built on TileLang. โšก 2โ€“3ร— forward speedup. 2ร— backward speedup. ๐Ÿ’ป Purpose-built for agentic AI on your personal devices. ๐Ÿ’กKey insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2ร—+ kernel-level speedups. We hope this is useful to the community!๐Ÿซถ๐Ÿซถ Learn more: ๐Ÿ“– Blog: ๐Ÿ’ป Code:
๋” ๋ณด๊ธฐ