๊ฐ€์ž… ํ›„ ์ดˆ๋Œ€ ๋งํฌ๋ฅผ ๊ณต์œ ํ•˜๋ฉด ๋™์˜์ƒ ์žฌ์ƒ ๋ฐ ์ดˆ๋Œ€ ๋ณด์ƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

vLLM
@vllm_project
A high-throughput and memory-efficient inference and serving engine for LLMs. Join to discuss together with the community!
๊ฐ€์ž… March 2024
36 ํŒ”๋กœ์ž‰ ์ค‘    38.6K ํŒฌ
Great work at @baseten running vLLM-Omni in production โ€” open-source, production-grade, cost-efficient omni-modal serving ๐ŸŽ™๏ธ Multi-stage audio, streaming multi-modal, real-time TTS โ€” workloads where closed-source APIs have been the default. โ†’
๋” ๋ณด๊ธฐ
We serve Qwen3-TTS on vLLM-Omni at $3 per 1M characters. That's 90% lower in cost than comparable closed-source TTS APIs. Our engineers optimized a single-replica serving stack to get there. Details on the optimized stack and cost per concurrent stream here.
๋” ๋ณด๊ธฐ