Great work!
tested on my 4x RTX Pro 6000 (workstation edition but limit power to 300W each) with PCIe 4.0:
tp=2, pp=2: prefill 1570, decode 34
tp=4, pp=1: prefill 967, decode 49
my dockerfile:
显示更多
GLM-5.1-478B-NVFP4
Running on:
- 4x RTX Pro 6000
- Sglang
- 370,000 max tokens (1.75x full context)
- p10 27.7 | p90 45.6 tok/s decode (gen)
- 1340 tok/s prefill
I could get 2x decode if I limit to 64k context (100 tok/s)
In this video it operates Figma (:
显示更多