๊ฐ€์ž… ํ›„ ์ดˆ๋Œ€ ๋งํฌ๋ฅผ ๊ณต์œ ํ•˜๋ฉด ๋™์˜์ƒ ์žฌ์ƒ ๋ฐ ์ดˆ๋Œ€ ๋ณด์ƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join to discuss together with the community!
๊ฐ€์ž… March 2024
36 ํŒ”๋กœ์ž‰ ์ค‘    38.6K ํŒฌ
Great read from the @RedHat_AI team โ€” a comprehensive investigation into TurboQuant in vLLM, with FP8 and BF16 as reference baselines: 4 models (30B to 200B+, decoder-only and MoE) and 5 benchmarks covering long-context retrieval and reasoning, all on the stable vLLM 0.20.2 release. If you're considering TurboQuant for your workload, this is the data to start from. ๐Ÿ“
๋” ๋ณด๊ธฐ
TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:
๋” ๋ณด๊ธฐ