Register and share your invite link to earn from video plays and referrals.

vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join to discuss together with the community!
36 Following    38.6K Followers
This week's vLLM Office Hours: @AMD on trends in AI agent applications. Every contribution ships upstream in vLLM main. The primitives agentic inference needs are all in vLLM today: 🧠 Prefix caching — automatic KV reuse across agent turns, lower TTFT 🦅 EAGLE / P-EAGLE spec decode — draft proposals verified in a single pass 🛠️ Tool calling — parallel calls + guided decoding for schema-compliant outputs 🌙 Mooncake KV connector — distributed KV offload for long agentic traces 💾 CPU KV offload — throughput gains once KV cache outgrows GPU memory 🧭 vLLM Semantic Router — route requests across small vs large models (joint work with @AIatAMD) full session 👇
Show more
[vLLM Office Hours #49#] Latest Trends in AI Agent Applications and vLLM - May 14, 2026
Excited to see TOKENSPEED_MLA integrated into vLLM on Blackwell GPUs. Happy to see more DeepSeek-R1 / Kimi-K2.5 users benefit from the software optimizations and acceleration brought by TokenSpeed MLA. Looking forward to more optimizations and collaborations with the open-source community ahead.
Show more
🎉 Day-0 vLLM support for Intern-S2-Preview! Congrats to the @intern_lm team — an open-source scientific multimodal foundation model, with a first take on material crystal structure generation alongside general capabilities. 📖
Show more
🥳Introducing Intern-S2-Preview, an efficient 35B scientific multimodal foundation model. 1⃣Delivers performance comparable to the trillion-scale Intern-S1-Pro on core scientific tasks. 2⃣The first open-source model with material crystal structure generation capabilities and strong general capabilities. 3⃣Significantly stronger scientific agent capabilities on multiple benchmarks. 4⃣Improves MTP acceptance rate and token generation speed via shared-weight MTP + KL loss. 5⃣CoT compression shortens responses while preserving strong reasoning , improving both performance and efficiency. 🥰Now supported by vLLM (@vllm_project) and SGLang ( @lmsysorg ) — with more ecosystem integrations on the way. 🤗Model: @huggingface @ModelScope2022 🤗GitHub: 🤗Try it now at:
Show more
Great work at @baseten running vLLM-Omni in production — open-source, production-grade, cost-efficient omni-modal serving 🎙️ Multi-stage audio, streaming multi-modal, real-time TTS — workloads where closed-source APIs have been the default. →
Show more
We serve Qwen3-TTS on vLLM-Omni at $3 per 1M characters. That's 90% lower in cost than comparable closed-source TTS APIs. Our engineers optimized a single-replica serving stack to get there. Details on the optimized stack and cost per concurrent stream here.
Show more
Congrats to @AntLingAGI on Ring-2.6-1T going open! 🎉 The thinking sibling of Ling-2.6-1T — trillion-scale, built for agent execution and complex reasoning. Day-0 vLLM support is ready. 🤗
Show more
We use renderers across Lab, verifiers, and prime-rl. We are collaborating with leading open-source partners, including @NVIDIA @vllm_project @sgl_project, to ensure it can become a useful standard across models, inference engines, and RL infra stacks throughout the ecosystem.
Show more
vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops @ArtificialAnlys on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. 🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. 🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to @inferact, @digitalocean, @nvidia, @RedHat_AI, and the vLLM community 🙏 Full breakdown 👇
Show more
Great read from the @RedHat_AI team — a comprehensive investigation into TurboQuant in vLLM, with FP8 and BF16 as reference baselines: 4 models (30B to 200B+, decoder-only and MoE) and 5 benchmarks covering long-context retrieval and reasoning, all on the stable vLLM 0.20.2 release. If you're considering TurboQuant for your workload, this is the data to start from. 📝
Show more
TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:
Show more
Michael Goin (@mgoin_) walks through @vllm_project v0.20.0. 752 commits. 320 contributors. 123 new. 🚀 🎉 DeepSeek V4, TurboQuant 2-bit KV cache, MXFP4 for MoE on Blackwell, FA4 as MLA prefill default, @PyTorch 2.11 + CUDA 13.0, Transformers V5, and a lot more. ~8 minutes.
Show more
🚀 vLLM-Omni v0.20.0 is out — aligned with upstream vLLM v0.20.0 (CUDA 13.0 · PyTorch 2.11 · Transformers 5.x). ⚡ Qwen3-Omni throughput +72% on H20, 32 conc (0.241 → 0.414 req/s) via talker / code2wav multi-replica scaling 🎙️ TTS faster & leaner: VoxCPM2 RTF 0.946 → 0.106 · Fish Speech Fast AR latency -53% · Qwen3-TTS / Voxtral-TTS Code2Wav saves ~3.2 GiB 🎨 Diffusion dynamic step-level batching: +7.8% throughput / -5.8% latency 🆕 New / improved: HunyuanImage-3.0, ERNIE T2I, AudioX, Wan2.2-S2V, LTX-2.3, FastGen Wan 2.1 📱 Wan2.2 on NPU production-ready: MindIE-SD, fused ops, VAE BF16, HSDP/USP — +50–60% perf 🧮 Quant expanded: Qwen Omni W4A16, OmniGen2 FP8, Z-Image FP8, HunyuanImage3 NPU, GLM-Image 🧩 Multi-backend updates across CUDA / ROCm / MUSA / NPU / XPU Check it out →
Show more
🎉 Day-0 vLLM support for Qwen3.6-27B! Congrats to @Alibaba_Qwen on the new 27B dense model release. Looking forward to more of the Qwen3.6 series. 👀 📖 Recipe:
Show more