註冊並分享邀請連結,可獲得影片播放與邀請獎勵。

NVIDIA AI
@NVIDIAAI
Teaching your AI new tricks.
加入 June 2016
855 正在關注    294.8K 粉絲
What if every decode step gave the next one a head start? Meet Guess-Verify-Refine — a new hardware-aware sparse-attention algorithm from NVIDIA Research. Built for TensorRT LLM on Blackwell, it reuses temporal patterns across decode steps for: → 1.88x faster Top-K attention → 9.3% better end-to-end latency in low-latency serving Dive into the paper:
顯示更多
0
8
175
28
轉發到社區