Register and share your invite link to earn from video plays and referrals.

vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join to discuss together with the community!
Joined March 2024
36 Following    38.6K Followers
Great read from the @RedHat_AI team โ€” a comprehensive investigation into TurboQuant in vLLM, with FP8 and BF16 as reference baselines: 4 models (30B to 200B+, decoder-only and MoE) and 5 benchmarks covering long-context retrieval and reasoning, all on the stable vLLM 0.20.2 release. If you're considering TurboQuant for your workload, this is the data to start from. ๐Ÿ“
Show more
TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:
Show more