Search vLLM on X — X Web Viewer

2026.05.11 21:33

vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops @ArtificialAnlys on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. 🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. 🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to @inferact, @digitalocean, @nvidia, @RedHat_AI, and the vLLM community 🙏 Full breakdown 👇

0

2

148

29

Forward to community

vLLM@vllm_project

2026.05.08 14:00

🚀 vLLM-Omni v0.20.0 is out — aligned with upstream vLLM v0.20.0 (CUDA 13.0 · PyTorch 2.11 · Transformers 5.x). ⚡ Qwen3-Omni throughput +72% on H20, 32 conc (0.241 → 0.414 req/s) via talker / code2wav multi-replica scaling 🎙️ TTS faster & leaner: VoxCPM2 RTF 0.946 → 0.106 · Fish Speech Fast AR latency -53% · Qwen3-TTS / Voxtral-TTS Code2Wav saves ~3.2 GiB 🎨 Diffusion dynamic step-level batching: +7.8% throughput / -5.8% latency 🆕 New / improved: HunyuanImage-3.0, ERNIE T2I, AudioX, Wan2.2-S2V, LTX-2.3, FastGen Wan 2.1 📱 Wan2.2 on NPU production-ready: MindIE-SD, fused ops, VAE BF16, HSDP/USP — +50–60% perf 🧮 Quant expanded: Qwen Omni W4A16, OmniGen2 FP8, Z-Image FP8, HunyuanImage3 NPU, GLM-Image 🧩 Multi-backend updates across CUDA / ROCm / MUSA / NPU / XPU Check it out →

0

15

235

27

Forward to community

virushuo@virushuo

2026.04.26 15:02

weekend project: 2x3090/vllm cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 200k context. swival as my coding agent. As long as models keep getting more powerful via RL, distillation, and quantization, GPU depreciation will be much slower than expected. Even a 3090 will remain very useful

0

3

33

1

Forward to community

Tom Dörr@tom_doerr

2026.05.15 06:50

Real-time Qwen3-TTS without vLLM or Triton

0

10

2

Forward to community

Matt White@matthew_d_white

2026.05.16 16:03

I’ll be at #MLSys# this week, May 18–22 🚀 PyTorch Foundation will have a booth with experts on PyTorch, vLLM, Ray + other foundation projects. Come by, ask questions, and meet the teams building open AI infra 🔥 I’m also speaking Monday morning on agentic self-improvement with OpenRoll 🤖 See you there 👋 #PyTorch# #vLLM# #Ray# @PyTorch @vllm_project @raydistributed @linuxfoundation @aaif_io

0

3

6

0

Forward to community

PyTorch@PyTorch

2026.02.12 22:44

We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and TensorRT-LLM, it unlocks new levels of throughput and scalability for large language model deployments. Mooncake enables prefill decode disaggregation, global KVCache reuse, elastic expert parallelism, and serves as a fault tolerant PyTorch distributed backend. 🔗 #PyTorch# #OpenSourceAI# #LLM# #AIInfrastructure#

0

7

403

51

Forward to community

Yuchen Jin@Yuchenj_UW

2026.05.07 17:02

An OpenAI friend told me he burns 300M GPT-5.5 tokens/day. The top one in his team burns billions of tokens/day. Codex coding for them every night. Databricks also gives engineers unlimited tokens. We're looking for cracked inference engineers to join us at Databricks AI to produce trillions of tokens, insanely fast. DM me if you have: - Contributed to open-source ML systems like SGLang/vLLM/PyTorch - Experience serving LLMs at large scale Databricks AI runs like a startup. Lots of exciting things to build!

0

96

1.2K

52

Forward to community