GLM-5.1-478B-NVFP4
Running on:
- 4x RTX Pro 6000
- Sglang
- 370,000 max tokens (1.75x full context)
- p10 27.7 | p90 45.6 tok/s decode (gen)
- 1340 tok/s prefill
I could get 2x decode if I limit to 64k context (100 tok/s)
In this video it operates Figma (:
GLM-5.1 shut my ZAI usage up so much, it's such a good model. huge leap for them even between this and GLM-5
It can run for hours without needing nudging, first open model IMO that hits this