Register and share your invite link to earn from video plays and referrals.

vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join to discuss together with the community!
Joined March 2024
36 Following    38.6K Followers
๐Ÿš€ vLLM-Omni v0.20.0 is out โ€” aligned with upstream vLLM v0.20.0 (CUDA 13.0 ยท PyTorch 2.11 ยท Transformers 5.x). โšก Qwen3-Omni throughput +72% on H20, 32 conc (0.241 โ†’ 0.414 req/s) via talker / code2wav multi-replica scaling ๐ŸŽ™๏ธ TTS faster & leaner: VoxCPM2 RTF 0.946 โ†’ 0.106 ยท Fish Speech Fast AR latency -53% ยท Qwen3-TTS / Voxtral-TTS Code2Wav saves ~3.2 GiB ๐ŸŽจ Diffusion dynamic step-level batching: +7.8% throughput / -5.8% latency ๐Ÿ†• New / improved: HunyuanImage-3.0, ERNIE T2I, AudioX, Wan2.2-S2V, LTX-2.3, FastGen Wan 2.1 ๐Ÿ“ฑ Wan2.2 on NPU production-ready: MindIE-SD, fused ops, VAE BF16, HSDP/USP โ€” +50โ€“60% perf ๐Ÿงฎ Quant expanded: Qwen Omni W4A16, OmniGen2 FP8, Z-Image FP8, HunyuanImage3 NPU, GLM-Image ๐Ÿงฉ Multi-backend updates across CUDA / ROCm / MUSA / NPU / XPU Check it out โ†’
Show more