๊ฐ€์ž… ํ›„ ์ดˆ๋Œ€ ๋งํฌ๋ฅผ ๊ณต์œ ํ•˜๋ฉด ๋™์˜์ƒ ์žฌ์ƒ ๋ฐ ์ดˆ๋Œ€ ๋ณด์ƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join to discuss together with the community!
๊ฐ€์ž… March 2024
36 ํŒ”๋กœ์ž‰ ์ค‘    38.7K ํŒฌ
๐Ÿš€ vLLM-Omni v0.20.0 is out โ€” aligned with upstream vLLM v0.20.0 (CUDA 13.0 ยท PyTorch 2.11 ยท Transformers 5.x). โšก Qwen3-Omni throughput +72% on H20, 32 conc (0.241 โ†’ 0.414 req/s) via talker / code2wav multi-replica scaling ๐ŸŽ™๏ธ TTS faster & leaner: VoxCPM2 RTF 0.946 โ†’ 0.106 ยท Fish Speech Fast AR latency -53% ยท Qwen3-TTS / Voxtral-TTS Code2Wav saves ~3.2 GiB ๐ŸŽจ Diffusion dynamic step-level batching: +7.8% throughput / -5.8% latency ๐Ÿ†• New / improved: HunyuanImage-3.0, ERNIE T2I, AudioX, Wan2.2-S2V, LTX-2.3, FastGen Wan 2.1 ๐Ÿ“ฑ Wan2.2 on NPU production-ready: MindIE-SD, fused ops, VAE BF16, HSDP/USP โ€” +50โ€“60% perf ๐Ÿงฎ Quant expanded: Qwen Omni W4A16, OmniGen2 FP8, Z-Image FP8, HunyuanImage3 NPU, GLM-Image ๐Ÿงฉ Multi-backend updates across CUDA / ROCm / MUSA / NPU / XPU Check it out โ†’
๋” ๋ณด๊ธฐ