注册并分享邀请链接,可获得视频播放与邀请奖励。

vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join to discuss together with the community!
加入 March 2024
36 正在关注    38.6K 粉丝
Great read from the @RedHat_AI team — a comprehensive investigation into TurboQuant in vLLM, with FP8 and BF16 as reference baselines: 4 models (30B to 200B+, decoder-only and MoE) and 5 benchmarks covering long-context retrieval and reasoning, all on the stable vLLM 0.20.2 release. If you're considering TurboQuant for your workload, this is the data to start from. 📝
显示更多
TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:
显示更多
0
9
281
41
转发到社区