Register and share your invite link to earn from video plays and referrals.

Search results for vLLM
vLLM community
One keyword maps to one global community path.
Create community
People
Not Found
Tweets including vLLM
vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops @ArtificialAnlys on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. 🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. 🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to @inferact, @digitalocean, @nvidia, @RedHat_AI, and the vLLM community 🙏 Full breakdown 👇
Show more
🚀 vLLM-Omni v0.20.0 is out — aligned with upstream vLLM v0.20.0 (CUDA 13.0 · PyTorch 2.11 · Transformers 5.x). ⚡ Qwen3-Omni throughput +72% on H20, 32 conc (0.241 → 0.414 req/s) via talker / code2wav multi-replica scaling 🎙️ TTS faster & leaner: VoxCPM2 RTF 0.946 → 0.106 · Fish Speech Fast AR latency -53% · Qwen3-TTS / Voxtral-TTS Code2Wav saves ~3.2 GiB 🎨 Diffusion dynamic step-level batching: +7.8% throughput / -5.8% latency 🆕 New / improved: HunyuanImage-3.0, ERNIE T2I, AudioX, Wan2.2-S2V, LTX-2.3, FastGen Wan 2.1 📱 Wan2.2 on NPU production-ready: MindIE-SD, fused ops, VAE BF16, HSDP/USP — +50–60% perf 🧮 Quant expanded: Qwen Omni W4A16, OmniGen2 FP8, Z-Image FP8, HunyuanImage3 NPU, GLM-Image 🧩 Multi-backend updates across CUDA / ROCm / MUSA / NPU / XPU Check it out →
Show more
weekend project: 2x3090/vllm cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 200k context. swival as my coding agent. As long as models keep getting more powerful via RL, distillation, and quantization, GPU depreciation will be much slower than expected. Even a 3090 will remain very useful
Show more
Real-time Qwen3-TTS without vLLM or Triton
I’ll be at #MLSys# this week, May 18–22 🚀 PyTorch Foundation will have a booth with experts on PyTorch, vLLM, Ray + other foundation projects. Come by, ask questions, and meet the teams building open AI infra 🔥 I’m also speaking Monday morning on agentic self-improvement with OpenRoll 🤖 See you there 👋 #PyTorch# #vLLM# #Ray# @PyTorch @vllm_project @raydistributed @linuxfoundation @aaif_io
Show more
We’re excited to welcome Mooncake to the PyTorch Ecosystem! Mooncake is designed to solve the “memory wall” in LLM serving. By integrating Mooncake’s high performance KVCache transfer and storage capabilities with PyTorch native inference engines like SGLang, vLLM, and TensorRT-LLM, it unlocks new levels of throughput and scalability for large language model deployments. Mooncake enables prefill decode disaggregation, global KVCache reuse, elastic expert parallelism, and serves as a fault tolerant PyTorch distributed backend. 🔗 #PyTorch# #OpenSourceAI# #LLM# #AIInfrastructure#
Show more
An OpenAI friend told me he burns 300M GPT-5.5 tokens/day. The top one in his team burns billions of tokens/day. Codex coding for them every night. Databricks also gives engineers unlimited tokens. We're looking for cracked inference engineers to join us at Databricks AI to produce trillions of tokens, insanely fast. DM me if you have: - Contributed to open-source ML systems like SGLang/vLLM/PyTorch - Experience serving LLMs at large scale Databricks AI runs like a startup. Lots of exciting things to build!
Show more
0
96
1.2K
52
Forward to community