BlanPlan(@blanplan):@iamai_omni 这条说到了跑过推理的人体感。线上做大模型推理, GPU 算力单元经常闲着 60-70%, 卡脖子的是 KV cache 在卡间搬, 还有算 attention 中间去取 weight 的等待。算力翻倍吞吐提一点, NVLink 带宽 + KV cache 一起优化下来延迟直接砍一半。

BlanPlan

@blanplan

CTO｜ex-Baidu｜AI, product, engineering & startup — real frontline notes

加入 February 2025

400 正在關注 439 粉絲

BlanPlan@blanplan

2026.05.16 04:37

@iamai_omni 这条说到了跑过推理的人体感。线上做大模型推理, GPU 算力单元经常闲着 60-70%, 卡脖子的是 KV cache 在卡间搬, 还有算 attention 中间去取 weight 的等待。算力翻倍吞吐提一点, NVLink 带宽 + KV cache 一起优化下来延迟直接砍一半。