GLM-5.1-478B-NVFP4
Running on:
- 4x RTX Pro 6000
- Sglang
- 370,000 max tokens (1.75x full context)
- p10 27.7 | p90 45.6 tok/s decode (gen)
- 1340 tok/s prefill
I could get 2x decode if I limit to 64k context (100 tok/s)
In this video it operates Figma (: