Register and share your invite link to earn from video plays and referrals.

fin
@fi56622380
立场不重要,事物的运行逻辑和内在规律才是更值得关注的部分 | 读过三个不同专业的学位,体验过两个大洲的尘世生活,设计过一次火星车芯片,还没有去看过心心念念的冰川
480 Following    52.7K Followers
People always ask: where's the next structural opportunity in AI chips? One of the structural shifts driven by the paradigm change over the past few months is heterogeneous AI inference — and SRAM-route startups led by Cerebras is right at the frontier of this new trend. Every year, Nvidia's GTC conference introduces paradigm-shifting concepts in technology, setting the benchmark for the entire industry. Everyone scrambles to rewrite their roadmaps and copy the homework after GTC. ------------------------ To understand where SRAM-route companies fit in the ecosystem, you first need to look at the workload characteristics of different stages of genAI inference. It breaks down into three parts: Prefill: extremely high compute intensity, low demand on memory bandwidth, moderate-to-high demand on memory size. Decode-stage attention: moderate compute intensity, extremely high demand on memory bandwidth (repeated reads/writes to KV cache), extremely high demand on memory size, because KV cache grows linearly as batch size increases. Decode-stage FFN: moderate compute intensity, extremely high demand on memory bandwidth (repeated reads of model weights), moderate-to-high demand on memory size (model weights). The characteristics of SRAM-route chips are equally clear: They push memory bandwidth to the absolute extreme, but everything else is a severe weakness. They fundamentally trade off compute intensity and the inability to scale memory size for the ultimate in memory bandwidth speed. --------------------- Now let's look at SRAM's suitability across the three stages of AI inference: Prefill: SRAM can't achieve high compute intensity because SRAM takes up too much die area, leaving limited space for compute units. So prefill is a weak point. Decode-stage attention: SRAM can meet the high memory bandwidth requirement, but its memory size is too small to handle the batch size demands. So SRAM only satisfies half the requirements for attention. Decode-stage FFN: SRAM can meet the high memory bandwidth requirement, and the memory size requirement is moderate. Through optimized interconnect communication, SRAM chips can barely solve the memory size problem — the cost is steep, but the ROI can still pencil out in certain scenarios. ---------- So the applicable scope of SRAM-route accelerators in heterogeneous AI inference is crystal clear: Prefill — forget about it. Garbage performance, garbage economics. Decode-stage FFN — with enough effort and added cost, it's within reach. Decode-stage attention — KV cache demands on memory size are too extreme, and the cost of batch processing is prohibitively high. Letting Cerebras's $2.3 million-per-chip, 45-chip, $100-million luxury system serve as an exclusive ultra-VIP service — sure, that technically works. Imagine this: one or two users running agent flow on a coding task with 1–2M context length, and it takes the entire 44GB of SRAM on a single $2.3M Cerebras chip just for KV cache, otherwise the speed tanks. What kind of extravagant service is that? ------------- So the conclusion couldn't be more obvious: if Cerebras tries to do full-stack AI inference on its own (prefill + decode ATTN + decode FFN), the economics simply don't work. There's no future in it. Because the cost of Cerebras is staggering. Even with their gross margins squeezed razor-thin, the implied rental rate of each CS-3 system is still $41.96/hour — roughly ten times the rental of a B200. And that's for a single CS-3. You need to chain many of them together for LLM inference, so multiply that rental by a lot more. This is precisely why the economics of the SRAM route are so poor — Nvidia already made this point crystal clear at GTC (see chart). People hyping up SRAM as the future replacement for HBM? That's a pipe dream. With SRAM scaling already hitting a wall, SRAM density per generation of chips is already nearly impossible to improve. On the memory size dimension, HBM's exponential growth will only widen the gap with SRAM further. Even on the memory bandwidth dimension, HBM is growing exponentially too, narrowing the gap with SRAM. So Nvidia's solution is elegant and clean: hand off the decode-stage FFN to the SRAM route, keep everything else on traditional HBM GPUs, and push the entire Pareto frontier significantly toward the upper right. Rubin + LPX breaks through 1000 tokens/s at peak speed while still maintaining enough overall throughput to generate real commercial value (this is critically important). Remember — on Blackwell, if you wanted to hit 400–500 tokens/s at high speed, you could only process a tiny handful of concurrent requests. That's a massive waste of GPU resources. But now, even at 1000 tokens/s, you can still maintain a meaningful batch size (throughput), and it can finally generate commercial value. The chart shows that at 400 tokens/s, Rubin + LPX delivers a 35x improvement in throughput — classic token economics. At that token speed, it represents a 35x improvement in commercial value over Blackwell. --------------------- After GTC revealed this standard answer — and even earlier, after the Groq LPU acquisition — everyone has already started copying this homework on the heterogeneous inference front. Google's TPU tapped Marvell for the SRAM component. Amazon AWS's Trainium tapped Cerebras for the SRAM component. ByteDance's AI ASIC tapped Qualcomm for the SRAM component. We will absolutely see more announcements like these in the future. And this is the best path to economic sustainability for Cerebras: stop trying to muscle through full-stack AI inference, focus on what you're good at, partner with mainstream AI ASICs on inference, and fight to embed your SRAM chips into other companies' decode FFN pipelines. This is also why the key to Cerebras's long-term trajectory hinges on how deeply it can integrate with AWS Trainium's disaggregated inference. If it's just what's been reported so far — Trainium handles prefill and Cerebras handles decode as a simple split — the technical implementation difficulty is much lower, but the economics still don't pencil out. It can only be a strategic positioning play — enough for a slice of the market, but not enough to generate real scaled competitive advantage. To follow the Nvidia playbook, they need deep integration of both companies' strengths — and that will take real time and technical effort, with non-trivial difficulty, but the payoff would be worth it. Solution one: Trainium handles prefill and decode attention, Cerebras handles decode FFN. Solution two: Cerebras runs the draft model, Trainium runs verification. Either solution delivers dramatically more market competitiveness. -------------------- Does this kind of partnership with mainstream AI ASICs shrink the TAM for SRAM-route companies? No. This is the only long-term sustainable path for SRAM-route companies to grow their market. Heterogeneous AI inference is unequivocally the future. Finding your piece of the puzzle early in this growing blueprint is the only way to grow alongside the market. The moment an SRAM-route company embeds itself into any mainstream AI ASIC's heterogeneous inference pipeline, its valuation will skyrocket — because the TAM in terms of unit shipments isn't even in the same order of magnitude. Otherwise, heterogeneous AI inference will relentlessly erode the speed advantage of the SRAM route on the token speed dimension (not throughput). Full-stack inference on the SRAM route degenerating into an expensive toy is an inevitable outcome.
Show more
复盘GTC 2026:Nvidia补上了短板,大幅削弱了各个AI 加速器 startup最大的优势--token速度 回顾这篇GTC前瞻,方向预测和技术路线写的没啥大问题,最后Nvidia给出的解法比我想象的更为精巧:不仅是prefill放在GPU上,decode阶段Attention阶段也放在GPU上(这点没想到),只把decode的MLP阶段放在LPU上做 这和MatX的解决方法有异曲同工之妙,Weights放在SRAM上,KV cache放在HBM上 这样的好处在于,Attention阶段需要巨量的KV cache(动辄几十上百GB),本就是LPU SRAM无法承受的,把这部分放在HBM上是更合理的选择 正应对了未来agentic flow里多轮对话上下文长,long context KV cache爆炸的趋势,即便是高batch并发数产生巨量KV cache也能让HBM容纳。随着上下文长度变长,所有的增量成本都在GPU HBM上,LPX是完全静态的不受影响,只和模型本身大小相关 让LPU宝贵的128GB SRAM只承担FFN/MLP阶段的固定weights,而FFN阶段占GPU整个decode阶段的50%以上,如果是短context甚至能占比超过60%,FFN这部分在LPU上得到数倍大幅加速,是很划算的 这样设计的部分drawback可能在于,一般transformer的decode阶段有很多层,比如以80层为例,那就是attention层和FFN要重复80次,也就是说,tensor要在GPU和LPU之间互相传递80次,虽然中间是low latency Nvidia Spectrum-X Ethenet,但生成一个token需要80次GPU-LPU往返延迟累加,这也是不小的损耗 这样的新架构,按attention和FFN各占40%/60%来算,FFN阶段加速几倍,极限最高速度来说,整体加速能达到一倍以上(和Rubin NVL 72比) 最高速度突破1000 token/s的同时,还能让整体throuhput仍然能保持一定的商业价值。要知道如果在Blackwell要跑到400~500 token/s高速,只能同时处理很少的几个请求,这对GPU资源是巨大的浪费。而现在就算是跑到1000 token/s,也能保持一定的batch size(吞吐量)了,终于也能产生商业价值了 图里说在400 token/s的速度下,Rubin + LPX把吞吐提升了35倍,就是典型的token经济学,这个token高速度下,从Blackwell算提升了35倍的商业价值 ------- Nvidia补上了这块短板之后,对各个startup(比如Cerebras, d-Matrix, MatX, SambaNova)有什么影响呢? startup最大的卖点就是特定场景下的速度优势,或者成本优势 在大batch(多请求)场景下,GPU的算术密度(arithmetic intensity)越过ridge point之后利用率接近很高,成本/速度都对startup有显著的优势。 所以这些startup能存活,最大的场景是:客户的workload集中在小batch、低延迟,速度极快,不在乎极高成本。GPU在这里效率极差,也达不到对应的token速度 Cerebras:极致的速度。wafer-scale尺寸巨大的SRAM(40GB),消灭芯片间通信这个最大瓶颈,在小batch用户数量小的场景下token rate极高。但成本完全没有竞争力,一台CS-3系统价格230万美元,远超同等GPU集群,跟H100比是十几倍的成本换十几倍的速度。 d-Matrix :高速度+小batch场景。in-memory compute减少data movement,在小batch decode下比GPU的利用率高,所以perf/watt在这个区间有一定竞争力。最近引入的3D stacked DRAM 是为了解决“更大的 reasoning model + 更高 token consumption”带来的容量/带宽继续扩展问题 SambaNova : 在企业私有化部署场景下,同时跑多个中小模型,GPU的利用率因为context switching损耗严重,SambaNova的RDU在这个场景下有更好的perf/dollar。本质上还是特定场景下的成本优势,通用速度优势并没有那么大 MatX:partitionable脉动阵列 + SRAM/HBM混合,和这次Nvidia的异构架构思路有相似的地方,最大的亮点是单芯片内实现Weights放在SRAM上,KV cache放在HBM上。但单芯片内省掉了前面提到的AFD的80层LPU-GPU芯片间通信,所以速度上仍然有一定优势,但Scalability可能不如GPU+LPU阵列了 总之,在Rubin + LPX情境下,小batch、低延迟,速度极快这个以前的场景缺口补上了很多,各个startup的优势空间越来越缩小了 ------------- 前瞻里提到的speculative decoding用LPU做草稿模型,用GPU去验证,这样的加速幅度会非常大,这个猜想完全命中,这次在官方blog里有了浓墨重彩的一笔,专门用了一整个章节来介绍这个用法:“LPX generates draft tokens rapidly using its low-latency architecture. Rubin GPUs verify and finalize tokens efficiently” 另外一个前瞻里提到的CPX (Content Phase aXcelerator,一个专门为prefill的compute bound特性设计的计算模块),似乎在这次GTC里完全消失了,一个字也没提,这是意味着CPX被彻底取消了吗? 我觉得不一定 目前的prefill和decode是disaggregated结构,也就是说一部分的GPU专门做prefill,另一部分专门做decode。CPX取代GPU做prefill从架构上来说是更合理的选择,可以加速prefill阶段,当然了会带来更高的成本,毕竟也是额外的一颗芯片 CPX和目前Nvidia的Rubin + LPX架构没有冲突的地方,仅仅只是简单的把做prefill的这部分GPU换成CPX而已,所以以后有速度优化需求的时候,也许CPX还会回来的 —------------------------------------------------------- 还是上篇的感慨,每一次计算范式的改变,半导体都会带来一波新的startup热潮,但当软件/应用形态逐渐收敛,最后还是变成了大厂通过收购把功能做大做全,参数做的更高,系统深度整合的更好更全面,成本更低,功耗和跑分更优秀,让startup慢慢失去独立生存的空间 比如移动互联网时代早期,也是群雄并起,有做AP应用处理器,独立基带芯片的,ISP的,GPU的各种小公司。但最后的赢家,都是从到后来把GPU,ISP,modem全都做进SoC,并且完成系统级整合的异构计算平台。 苹果收购PA semi的CPU,英飞凌的modem,掏空Imagination的GPU;高通收购ATI的mGPU,Atheros的Wifi,Nuvia的CPU,CSR的蓝牙/DSP,都是典型例子 异构推理的复杂度越来越高,能做系统级整合的公司会更有优势,这和移动SoC时代的逻辑一模一样。AI时代nvidia收购arm(失败),收购Mellanox,收购groq,只是这个新历史轮回的开始
Show more
AI Semiconductor Endgame 2026 (Part 1) New Token Economics Computing Paradigm Shifts from GPU Compute to HBM This article starts from the essence of GPU architectural evolution to address a question the market has long worried about: Why must each GPU's HBM memory demand grow exponentially, and why won't this exponential growth in HBM demand stall? It then derives the first principle of token economics under the current architecture: token throughput = HBM size × HBM BW (bandwidth) It also discusses why the GPU ceiling is determined by HBM's two dimensions of progress. The topic of HBM cyclicality has long been controversial. Optimists argue that AI-driven demand is much greater than before, but the market mainstream still believes that previous up-cycles also saw 20%+ annual demand growth — so what's different this time? AI doesn't change the fact that HBM, like traditional DRAM, has commodity attributes. Once capacity expansion at the demand peak meets a downturn, history will repeat itself. We can take the perspective of compute-chip architecture, start from first principles, and unpack and reason through this question: why this time is genuinely different. ——————————————————————————————— History: The Era of CPU Compute For a very long time, we lived in the era of CPU-dominated compute. The CPU's top-level KPI was performance — running faster — and so each generation of CPUs deployed every method imaginable to push benchmark scores higher. First it was rising clock frequencies, then it was architectural evolution: superscalar designs, and so on. During this period, why didn't DDR need to advance technologically at high speed? DDR3 to DDR5 took a full 15 years. Because in this era, DDR's role was purely auxiliary — and only weakly so. By industry experience, even doubling DDR speed would generally only raise CPU performance by less than 20%. Why did improvements in DDR bandwidth and speed matter so little? Two reasons: 1. CPUs designed all kinds of architectural tricks to hide DDR latency — superscalar designs, wider issue widths, massive ROBs and register renaming to extract parallelism and hide latency, L1 caches, L2 caches — all of which weakened the demand for DDR bandwidth and speed. 2. CPU workloads don't have particularly demanding bandwidth requirements. For most everyday workloads — say, opening a webpage — DDR bandwidth is severely overprovisioned. Even cloud workloads often look the same. In other words, in the CPU era, DDR bandwidth and speed didn't really matter. There was virtually no difference between DDR4 and DDR5 except in a handful of games — and even the JEDEC standard advanced slowly. On top of that, only a small portion of any given app needs to permanently sit in DDR. Whatever is needed can be paged in from the hard drive on demand. App size grew slowly, and so DDR capacity demand grew slowly as well. That's why, over the past decade, the average PC went from 7–8GB of DDR to about 23GB — only 3× growth in ten years. This slow upgrade pace directly affected revenue. Capacity-based pricing was the main way of making money; speed improvements were just a technological upgrade that raised the unit price of capacity. With both of these dimensions advancing slowly, growth could only come from increases in PC/phone unit volumes. So along both dimensions — bandwidth/speed and capacity — DRAM was always a “nice-to-have” appendage to the chip industry. The marginal utility of DDR upgrades was very low, and almost completely disconnected from the CPU era's top-level KPI. ——————————————————————————————— The Paradigm Shift: GenAI's Top-Level KPI When we entered the era of GenAI large models, the computing paradigm shifted, and the top-level KPI changed fundamentally. By the time GPUs evolved into AI inference engines, the top-level KPI was no longer compute alone (TOPS/FLOPS), as it had been for CPUs — it became the cost of a token. Specifically: overall token throughput per unit cost / per unit power. A close second is token throughput speed — because in the agent era, many tasks have become serial, and token output speed has become a critical bottleneck for user experience. This is exactly why Jensen invented the concept of the AI factory: to produce the most tokens at the lowest cost, while pushing token throughput speed as high as possible. In the AI training era, Jensen's economics were TCO (Total Cost of Ownership): the more GPUs you buy, the more you save. In the inference era, Jensen's token economics flip the logic: AI inference has very healthy gross margins, so the logic now becomes: the NVIDIA GPU is the GPU that produces the cheapest token in the world, so the more you buy, the more you earn. The top-level KPI has become a Pareto frontier: along the two dimensions of token throughput and token speed, optimize as far as possible. Each generation of NVIDIA's token factory is essentially pushing the entire Pareto frontier up and to the right. This is the most important KPI of the AI inference era. ——————————————————————————————— From Token Throughput to HBM: The Core Logic Chain Below is the most important logical chain of this article: how to start from the exponential growth of token throughput and derive that the ceiling bottleneck lies in the exponential growth of HBM size and HBM speed. In the era of single-GPU inference with single-thread batch size = 1, token throughput had only one dimension: HBM bandwidth speed. Higher bandwidth = higher token throughput. But once we entered the NVL72 era, inference is no longer single-GPU. It is a system-level token factory composed of 72 GPUs + 36 CPUs, designed to fully saturate HBM bandwidth and compute simultaneously, in pursuit of the ultimate token throughput. Token throughput growth depends on two things: the number of requests batched simultaneously × the average token speed per request. That is: batch size × token speed. Take Rubin NVL72 as an example. At an average token speed of 100 tokens/s, processing 1,920 simultaneous requests yields a token throughput of 192,000 tokens/s. A Rubin NVL72 draws roughly 120kW (0.12MW), so per MW it can handle 1.6M tokens/s. So we need to find ways to push both parameters up: batch size and average token speed. Their product is our top-level KPI — token throughput. Parameter 1: Batch growth — bottleneck is HBM size Every request in the batch carries its own KV cache, which has to live in HBM, with sizes ranging from a few GB to tens of GB. Because hot KV cache must be read at high frequency and high speed at any moment, it must reside in HBM. For a model with, say, 80 layers, every token generation step requires reading the KV cache 80 times from HBM. As batch size grows, hot KV cache grows linearly. And because the hot KV cache for every request in the batch must sit in HBM, HBM size must grow linearly with batch size. Like an airport shuttle bus: the gate wants to move passengers to the plane as fast as possible. If HBM size is small, the shuttle is small, so you have to make extra trips. Conclusion: batch size growth bottlenecks on HBM size growth. Parameter 2: Average token speed per request — bottleneck is HBM bandwidth The decode-phase speed of a large model bottlenecks on HBM bandwidth, because every token generated requires reading the activated weights and KV cache many times over. The emergence of LPUs has, in cases where batch size isn't very large, moved the activated weights portion onto SRAM — but every generated token still requires many reads of the KV cache from HBM. The higher the HBM bandwidth, the faster each token is generated, in essentially linear correspondence. Like the airport shuttle bus: HBM bandwidth is like the width of the door — wider doors mean passengers board faster. The rest of the GPU's configuration is essentially adapted to support batch growth and to keep token compute speed in step with HBM growth. In some cases the GPU even spends excess compute to recover effective bandwidth (e.g., bandwidth compression techniques). —------- To return to the shuttle bus analogy: • Shuttle bus cabin size = HBM Size (capacity): determines how many passengers can fit at once (i.e., how many requests' KV caches can sit in HBM simultaneously). Bigger cabin = more passengers (higher batch size) per trip. If the bus is too small, moving 100 people takes two trips — and total throughput suffers. • Shuttle bus door width = HBM Bandwidth: determines how fast passengers get on and off. A wide door, and everyone piles on at once (decode/token generation is fast). A narrow door, and even with a giant cabin, people queue up and most of the time is spent boarding. • Passenger throughput = cabin size × door-width-determined boarding speed. —------- At this point, we've logically derived the first principle of token-economics hardware demand: Token throughput = HBM size × HBM Bandwidth The top-level KPI of the AI inference era is highly dependent on progress along both HBM dimensions. If we want to maintain 2× token throughput growth per generation, that means each generation of single GPU must grow HBM size × HBM BW speed by 2×! This is the first time in history that HBM memory size can influence the top-level KPI — token throughput. To validate this thesis, we can put NVIDIA's token throughput from A100 to Rubin Ultra on the same chart as HBM size × HBM BW speed. What you find is that the two curves track each other startlingly closely on log axes. HBM size × speed actually grows even faster than token throughput — which makes sense, because HBM defines the ceiling, and in practice utilization of that ceiling is very hard to push to 100%. Even if HBM size × HBM speed grew by 1,000×, with the supporting compute and architecture, it would be very hard to wring out the full 1,000× of headroom. This curve isn't a coincidence — it's the necessary solution of system optimization. throughput = batch × speed. This is the unavoidable first principle of token factory economics. —------- What about software? Won't software optimization reduce bandwidth demand? Reduce HBM demand? This is an independent dimension from hardware. It's like asking: if software on a CPU runs faster after optimization, does that mean the CPU doesn't need to advance for ten years? After all, software is faster now. If that were the case, would CPU vendors still make money? For a CPU vendor to survive, there's only one path: in standardized benchmarks, ignoring software optimization, every new CPU generation must score higher — otherwise it doesn't sell. GPUs are exactly the same. How well software is optimized, and the requirement that the GPU's own token-throughput KPI must improve dramatically every year, are two separate things. As long as token demand keeps growing, the pursuit of higher token throughput will not stop — and so neither will the pursuit of higher HBM size × HBM speed. If HBM size and HBM speed were to slow down, Jensen would personally fly to the Big Three and pressure them to accelerate, because that ishis GPU ceiling. If the ceiling stops rising, can his GPU still sell? Of course, NVIDIA also needs to wrack its brains to extract performance beyond the HBM ceiling through heterogeneous architectural angles. The LPU is a great example — it improved the Pareto frontier substantially from a different angle (the right-hand high-token-speed portion). —-------------------- HBM memory has now bid farewell to that old era of drifting with the tide. On this one-way road paved by exponential demand, it has, in something close to a destined fashion, walked onto the central stage of the industry's epic. When the inference paradigm's first principles evolve to this point, as long as Jensen still wants to sell GPUs, HBM must double — and it must double every generation. This is endogenous pressure from the supply side. It has nothing to do with AI demand, nothing to do with macro cycles, and nothing to do with the moods of the hyperscalers. The only remaining question is this: When demand has been physically locked into exponential growth, will the three players on the supply side — like they have for the past thirty years — once again drag themselves back into the mire of the cycle by their own hands?
Show more
0
24
829
135
Forward to community