๊ฐ€์ž… ํ›„ ์ดˆ๋Œ€ ๋งํฌ๋ฅผ ๊ณต์œ ํ•˜๋ฉด ๋™์˜์ƒ ์žฌ์ƒ ๋ฐ ์ดˆ๋Œ€ ๋ณด์ƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join to discuss together with the community!
๊ฐ€์ž… March 2024
36 ํŒ”๋กœ์ž‰ ์ค‘    38.6K ํŒฌ
vLLM tops the Artificial Analysis leaderboard ๐ŸŽ‰ vLLM tops @ArtificialAnlys on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: ๐Ÿ”น DeepSeek V3.2 โ€” Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. ๐Ÿ”น MiniMax-M2.5 โ€” Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. ๐Ÿ”น Qwen 3.5 397B โ€” Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to @inferact, @digitalocean, @nvidia, @RedHat_AI, and the vLLM community ๐Ÿ™ Full breakdown ๐Ÿ‘‡
๋” ๋ณด๊ธฐ