This week's vLLM Office Hours:
@AMD on trends in AI agent applications. Every contribution ships upstream in vLLM main.
The primitives agentic inference needs are all in vLLM today:
🧠 Prefix caching — automatic KV reuse across agent turns, lower TTFT
🦅 EAGLE / P-EAGLE spec decode — draft proposals verified in a single pass
🛠️ Tool calling — parallel calls + guided decoding for schema-compliant outputs
🌙 Mooncake KV connector — distributed KV offload for long agentic traces
💾 CPU KV offload — throughput gains once KV cache outgrows GPU memory
🧭 vLLM Semantic Router — route requests across small vs large models (joint work with
@AIatAMD)
full session 👇