Sudo su(@sudoingX):people keep asking what engine i use. no lm studio. no ollama. i compile llama.cpp from source every time for personal inference. no abstraction layers. if you're serious about local inference, start at source level. it's a no brainer. here's why. when you compile from source you control everything. which cuda arch to target. which quantization kernels to enable. flash attention flags. context size limits. you're not waiting for some gui app to update when gguf format changes or a new quant drops. you pull, you build, you run. minutes not days. lm studio and ollama are fine for trying things. but the moment you need custom context lengths, specific kv cache configs, or hardware specific optimizations like the GB10 tensor cores on my spark those abstractions become walls. compiling from source means when something breaks you know exactly where. when something is slow you know exactly why. there's no black box between you and the metal. that's the difference between using local ai and understanding local ai.

2026.05.17 06:23

people keep asking what engine i use. no lm studio. no ollama. i compile llama.cpp from source every time for personal inference. no abstraction layers. if you're serious about local inference, start at source level. it's a no brainer. here's why. when you compile from source you control everything. which cuda arch to target. which quantization kernels to enable. flash attention flags. context size limits. you're not waiting for some gui app to update when gguf format changes or a new quant drops. you pull, you build, you run. minutes not days. lm studio and ollama are fine for trying things. but the moment you need custom context lengths, specific kv cache configs, or hardware specific optimizations like the GB10 tensor cores on my spark those abstractions become walls. compiling from source means when something breaks you know exactly where. when something is slow you know exactly why. there's no black box between you and the metal. that's the difference between using local ai and understanding local ai.

显示更多

Sudo su@sudoingX

2026.05.17 05:10

i've run a stack of models across a single 3090, a 5090, and a 128GB DGX Spark. exactly three are worth building on. the honest list. the three worth it: > 1. StepFun Step-3.5 Flash, the REAP pruned 121B MoE (Q6, DGX Spark) a 121 billion parameter mixture of experts running on a single desktop box. the most worth-it model in everything i've tested. > 2. Qwen 3.6 27B Dense, Q4 (single RTX 3090) the undisputed king of the 24GB tier. one shot a playable game, around 41 tok/s, fits with context headroom to spare. one 24GB card, this is your answer. > 3. NVIDIA Nemotron 3 Nano Omni, 30B-A3B (DGX Spark) the best multimodal i've tested for video classification work. vision in, runs clean on the Spark. the rest, ran them, they hold up fine: on the Spark: DeepSeek V4 Flash 158B, GLM 4.7 Flash, GLM 4.5 Air REAP 82B-A12B, Gemma 4 26B-A4B, Qwen3-VL 235B-A22B, Qwen3 Coder 30B-A3B, Qwen3 30B-A3B, Carnice 35B-A3B. on consumer GPUs: Kimi K2.5 1T, Qwen3-Coder-Next 80B, Hermes 4.3 36B, Qwen 3.5 27B Dense. single 3090 to a 128GB Spark, that's the range. the three up top are the ones worth your hardware today.

显示更多

339

转发到社区

热门用户