This week's vLLM Office Hours:
@AMD on trends in AI agent applications. Every contribution ships upstream in vLLM main.
The primitives agentic inference needs are all in vLLM today:
๐ง Prefix caching โ automatic KV reuse across agent turns, lower TTFT
๐ฆ
EAGLE / P-EAGLE spec decode โ draft proposals verified in a single pass
๐ ๏ธ Tool calling โ parallel calls + guided decoding for schema-compliant outputs
๐ Mooncake KV connector โ distributed KV offload for long agentic traces
๐พ CPU KV offload โ throughput gains once KV cache outgrows GPU memory
๐งญ vLLM Semantic Router โ route requests across small vs large models (joint work with
@AIatAMD)
full session ๐