๊ฐ€์ž… ํ›„ ์ดˆ๋Œ€ ๋งํฌ๋ฅผ ๊ณต์œ ํ•˜๋ฉด ๋™์˜์ƒ ์žฌ์ƒ ๋ฐ ์ดˆ๋Œ€ ๋ณด์ƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Wentao Guo
@WentaoGuo7
CS PhD student @PrincetonCS, Previously CS MEng + BS @CornellCIS
๊ฐ€์ž… November 2021
199 ํŒ”๋กœ์ž‰ ์ค‘    1K ํŒฌ
๐Ÿš€SonicMoE๐Ÿš€now runs at peak throughput on NVIDIA Blackwell GPUs ๐Ÿ˜ƒ 54% & 35% higher fwd/bwd TFLOPS than the DeepGEMM baseline and 21% higher fwd TFLOPS than the triton official example. SonicMoE still maintains its minimum activation memory footprint: the same as a dense model with equal activated parameters and independent of expert granularity. We wrote a blogpost on how we leveraged Blackwell features and the software abstraction on QuACK: Work with @MayankMish98, @XinleC295, @istoica05, @tri_dao
๋” ๋ณด๊ธฐ