fin(@fi56622380):People always ask: where's the next structural opportunity in AI chips? One of the structural shifts driven by the paradigm change over the past few months is heterogeneous AI inference — and SRAM-route startups led by Cerebras is right at the frontier of this new trend. Every year, Nvidia's GTC conference introduces paradigm-shifting concepts in technology, setting the benchmark for the entire industry. Everyone scrambles to rewrite their roadmaps and copy the homework after GTC. ------------------------ To understand where SRAM-route companies fit in the ecosystem, you first need to look at the workload characteristics of different stages of genAI inference. It breaks down into three parts: Prefill: extremely high compute intensity, low demand on memory bandwidth, moderate-to-high demand on memory size. Decode-stage attention: moderate compute intensity, extremely high demand on memory bandwidth (repeated reads/writes to KV cache), extremely high demand on memory size, because KV cache grows linearly as batch size increases. Decode-stage FFN: moderate compute intensity, extremely high demand on memory bandwidth (repeated reads of model weights), moderate-to-high demand on memory size (model weights). The characteristics of SRAM-route chips are equally clear: They push memory bandwidth to the absolute extreme, but everything else is a severe weakness. They fundamentally trade off compute intensity and the inability to scale memory size for the ultimate in memory bandwidth speed. --------------------- Now let's look at SRAM's suitability across the three stages of AI inference: Prefill: SRAM can't achieve high compute intensity because SRAM takes up too much die area, leaving limited space for compute units. So prefill is a weak point. Decode-stage attention: SRAM can meet the high memory bandwidth requirement, but its memory size is too small to handle the batch size demands. So SRAM only satisfies half the requirements for attention. Decode-stage FFN: SRAM can meet the high memory bandwidth requirement, and the memory size requirement is moderate. Through optimized interconnect communication, SRAM chips can barely solve the memory size problem — the cost is steep, but the ROI can still pencil out in certain scenarios. ---------- So the applicable scope of SRAM-route accelerators in heterogeneous AI inference is crystal clear: Prefill — forget about it. Garbage performance, garbage economics. Decode-stage FFN — with enough effort and added cost, it's within reach. Decode-stage attention — KV cache demands on memory size are too extreme, and the cost of batch processing is prohibitively high. Letting Cerebras's $2.3 million-per-chip, 45-chip, $100-million luxury system serve as an exclusive ultra-VIP service — sure, that technically works. Imagine this: one or two users running agent flow on a coding task with 1–2M context length, and it takes the entire 44GB of SRAM on a single $2.3M Cerebras chip just for KV cache, otherwise the speed tanks. What kind of extravagant service is that? ------------- So the conclusion couldn't be more obvious: if Cerebras tries to do full-stack AI inference on its own (prefill + decode ATTN + decode FFN), the economics simply don't work. There's no future in it. Because the cost of Cerebras is staggering. Even with their gross margins squeezed razor-thin, the implied rental rate of each CS-3 system is still $41.96/hour — roughly ten times the rental of a B200. And that's for a single CS-3. You need to chain many of them together for LLM inference, so multiply that rental by a lot more. This is precisely why the economics of the SRAM route are so poor — Nvidia already made this point crystal clear at GTC (see chart). People hyping up SRAM as the future replacement for HBM? That's a pipe dream. With SRAM scaling already hitting a wall, SRAM density per generation of chips is already nearly impossible to improve. On the memory size dimension, HBM's exponential growth will only widen the gap with SRAM further. Even on the memory bandwidth dimension, HBM is growing exponentially too, narrowing the gap with SRAM. So Nvidia's solution is elegant and clean: hand off the decode-stage FFN to the SRAM route, keep everything else on traditional HBM GPUs, and push the entire Pareto frontier significantly toward the upper right. Rubin + LPX breaks through 1000 tokens/s at peak speed while still maintaining enough overall throughput to generate real commercial value (this is critically important). Remember — on Blackwell, if you wanted to hit 400–500 tokens/s at high speed, you could only process a tiny handful of concurrent requests. That's a massive waste of GPU resources. But now, even at 1000 tokens/s, you can still maintain a meaningful batch size (throughput), and it can finally generate commercial value. The chart shows that at 400 tokens/s, Rubin + LPX delivers a 35x improvement in throughput — classic token economics. At that token speed, it represents a 35x improvement in commercial value over Blackwell. --------------------- After GTC revealed this standard answer — and even earlier, after the Groq LPU acquisition — everyone has already started copying this homework on the heterogeneous inference front. Google's TPU tapped Marvell for the SRAM component. Amazon AWS's Trainium tapped Cerebras for the SRAM component. ByteDance's AI ASIC tapped Qualcomm for the SRAM component. We will absolutely see more announcements like these in the future. And this is the best path to economic sustainability for Cerebras: stop trying to muscle through full-stack AI inference, focus on what you're good at, partner with mainstream AI ASICs on inference, and fight to embed your SRAM chips into other companies' decode FFN pipelines. This is also why the key to Cerebras's long-term trajectory hinges on how deeply it can integrate with AWS Trainium's disaggregated inference. If it's just what's been reported so far — Trainium handles prefill and Cerebras handles decode as a simple split — the technical implementation difficulty is much lower, but the economics still don't pencil out. It can only be a strategic positioning play — enough for a slice of the market, but not enough to generate real scaled competitive advantage. To follow the Nvidia playbook, they need deep integration of both companies' strengths — and that will take real time and technical effort, with non-trivial difficulty, but the payoff would be worth it. Solution one: Trainium handles prefill and decode attention, Cerebras handles decode FFN. Solution two: Cerebras runs the draft model, Trainium runs verification. Either solution delivers dramatically more market competitiveness. -------------------- Does this kind of partnership with mainstream AI ASICs shrink the TAM for SRAM-route companies? No. This is the only long-term sustainable path for SRAM-route companies to grow their market. Heterogeneous AI inference is unequivocally the future. Finding your piece of the puzzle early in this growing blueprint is the only way to grow alongside the market. The moment an SRAM-route company embeds itself into any mainstream AI ASIC's heterogeneous inference pipeline, its valuation will skyrocket — because the TAM in terms of unit shipments isn't even in the same order of magnitude. Otherwise, heterogeneous AI inference will relentlessly erode the speed advantage of the SRAM route on the token speed dimension (not throughput). Full-stack inference on the SRAM route degenerating into an expensive toy is an inevitable outcome.

2026.05.15 05:11

People always ask: where's the next structural opportunity in AI chips? One of the structural shifts driven by the paradigm change over the past few months is heterogeneous AI inference — and SRAM-route startups led by Cerebras is right at the frontier of this new trend. Every year, Nvidia's GTC conference introduces paradigm-shifting concepts in technology, setting the benchmark for the entire industry. Everyone scrambles to rewrite their roadmaps and copy the homework after GTC. ------------------------ To understand where SRAM-route companies fit in the ecosystem, you first need to look at the workload characteristics of different stages of genAI inference. It breaks down into three parts: Prefill: extremely high compute intensity, low demand on memory bandwidth, moderate-to-high demand on memory size. Decode-stage attention: moderate compute intensity, extremely high demand on memory bandwidth (repeated reads/writes to KV cache), extremely high demand on memory size, because KV cache grows linearly as batch size increases. Decode-stage FFN: moderate compute intensity, extremely high demand on memory bandwidth (repeated reads of model weights), moderate-to-high demand on memory size (model weights). The characteristics of SRAM-route chips are equally clear: They push memory bandwidth to the absolute extreme, but everything else is a severe weakness. They fundamentally trade off compute intensity and the inability to scale memory size for the ultimate in memory bandwidth speed. --------------------- Now let's look at SRAM's suitability across the three stages of AI inference: Prefill: SRAM can't achieve high compute intensity because SRAM takes up too much die area, leaving limited space for compute units. So prefill is a weak point. Decode-stage attention: SRAM can meet the high memory bandwidth requirement, but its memory size is too small to handle the batch size demands. So SRAM only satisfies half the requirements for attention. Decode-stage FFN: SRAM can meet the high memory bandwidth requirement, and the memory size requirement is moderate. Through optimized interconnect communication, SRAM chips can barely solve the memory size problem — the cost is steep, but the ROI can still pencil out in certain scenarios. ---------- So the applicable scope of SRAM-route accelerators in heterogeneous AI inference is crystal clear: Prefill — forget about it. Garbage performance, garbage economics. Decode-stage FFN — with enough effort and added cost, it's within reach. Decode-stage attention — KV cache demands on memory size are too extreme, and the cost of batch processing is prohibitively high. Letting Cerebras's $2.3 million-per-chip, 45-chip, $100-million luxury system serve as an exclusive ultra-VIP service — sure, that technically works. Imagine this: one or two users running agent flow on a coding task with 1–2M context length, and it takes the entire 44GB of SRAM on a single $2.3M Cerebras chip just for KV cache, otherwise the speed tanks. What kind of extravagant service is that? ------------- So the conclusion couldn't be more obvious: if Cerebras tries to do full-stack AI inference on its own (prefill + decode ATTN + decode FFN), the economics simply don't work. There's no future in it. Because the cost of Cerebras is staggering. Even with their gross margins squeezed razor-thin, the implied rental rate of each CS-3 system is still $41.96/hour — roughly ten times the rental of a B200. And that's for a single CS-3. You need to chain many of them together for LLM inference, so multiply that rental by a lot more. This is precisely why the economics of the SRAM route are so poor — Nvidia already made this point crystal clear at GTC (see chart). People hyping up SRAM as the future replacement for HBM? That's a pipe dream. With SRAM scaling already hitting a wall, SRAM density per generation of chips is already nearly impossible to improve. On the memory size dimension, HBM's exponential growth will only widen the gap with SRAM further. Even on the memory bandwidth dimension, HBM is growing exponentially too, narrowing the gap with SRAM. So Nvidia's solution is elegant and clean: hand off the decode-stage FFN to the SRAM route, keep everything else on traditional HBM GPUs, and push the entire Pareto frontier significantly toward the upper right. Rubin + LPX breaks through 1000 tokens/s at peak speed while still maintaining enough overall throughput to generate real commercial value (this is critically important). Remember — on Blackwell, if you wanted to hit 400–500 tokens/s at high speed, you could only process a tiny handful of concurrent requests. That's a massive waste of GPU resources. But now, even at 1000 tokens/s, you can still maintain a meaningful batch size (throughput), and it can finally generate commercial value. The chart shows that at 400 tokens/s, Rubin + LPX delivers a 35x improvement in throughput — classic token economics. At that token speed, it represents a 35x improvement in commercial value over Blackwell. --------------------- After GTC revealed this standard answer — and even earlier, after the Groq LPU acquisition — everyone has already started copying this homework on the heterogeneous inference front. Google's TPU tapped Marvell for the SRAM component. Amazon AWS's Trainium tapped Cerebras for the SRAM component. ByteDance's AI ASIC tapped Qualcomm for the SRAM component. We will absolutely see more announcements like these in the future. And this is the best path to economic sustainability for Cerebras: stop trying to muscle through full-stack AI inference, focus on what you're good at, partner with mainstream AI ASICs on inference, and fight to embed your SRAM chips into other companies' decode FFN pipelines. This is also why the key to Cerebras's long-term trajectory hinges on how deeply it can integrate with AWS Trainium's disaggregated inference. If it's just what's been reported so far — Trainium handles prefill and Cerebras handles decode as a simple split — the technical implementation difficulty is much lower, but the economics still don't pencil out. It can only be a strategic positioning play — enough for a slice of the market, but not enough to generate real scaled competitive advantage. To follow the Nvidia playbook, they need deep integration of both companies' strengths — and that will take real time and technical effort, with non-trivial difficulty, but the payoff would be worth it. Solution one: Trainium handles prefill and decode attention, Cerebras handles decode FFN. Solution two: Cerebras runs the draft model, Trainium runs verification. Either solution delivers dramatically more market competitiveness. -------------------- Does this kind of partnership with mainstream AI ASICs shrink the TAM for SRAM-route companies? No. This is the only long-term sustainable path for SRAM-route companies to grow their market. Heterogeneous AI inference is unequivocally the future. Finding your piece of the puzzle early in this growing blueprint is the only way to grow alongside the market. The moment an SRAM-route company embeds itself into any mainstream AI ASIC's heterogeneous inference pipeline, its valuation will skyrocket — because the TAM in terms of unit shipments isn't even in the same order of magnitude. Otherwise, heterogeneous AI inference will relentlessly erode the speed advantage of the SRAM route on the token speed dimension (not throughput). Full-stack inference on the SRAM route degenerating into an expensive toy is an inevitable outcome.

显示更多

fin@fi56622380

2026.03.22 04:44

复盘GTC 2026：Nvidia补上了短板，大幅削弱了各个AI 加速器 startup最大的优势--token速度回顾这篇GTC前瞻，方向预测和技术路线写的没啥大问题，最后Nvidia给出的解法比我想象的更为精巧：不仅是prefill放在GPU上，decode阶段Attention阶段也放在GPU上(这点没想到)，只把decode的MLP阶段放在LPU上做这和MatX的解决方法有异曲同工之妙，Weights放在SRAM上，KV cache放在HBM上这样的好处在于，Attention阶段需要巨量的KV cache(动辄几十上百GB)，本就是LPU SRAM无法承受的，把这部分放在HBM上是更合理的选择正应对了未来agentic flow里多轮对话上下文长，long context KV cache爆炸的趋势，即便是高batch并发数产生巨量KV cache也能让HBM容纳。随着上下文长度变长，所有的增量成本都在GPU HBM上，LPX是完全静态的不受影响，只和模型本身大小相关让LPU宝贵的128GB SRAM只承担FFN/MLP阶段的固定weights，而FFN阶段占GPU整个decode阶段的50%以上，如果是短context甚至能占比超过60%，FFN这部分在LPU上得到数倍大幅加速，是很划算的这样设计的部分drawback可能在于，一般transformer的decode阶段有很多层，比如以80层为例，那就是attention层和FFN要重复80次，也就是说，tensor要在GPU和LPU之间互相传递80次，虽然中间是low latency Nvidia Spectrum-X Ethenet，但生成一个token需要80次GPU-LPU往返延迟累加，这也是不小的损耗这样的新架构，按attention和FFN各占40%/60%来算，FFN阶段加速几倍，极限最高速度来说，整体加速能达到一倍以上（和Rubin NVL 72比）最高速度突破1000 token/s的同时，还能让整体throuhput仍然能保持一定的商业价值。要知道如果在Blackwell要跑到400~500 token/s高速，只能同时处理很少的几个请求，这对GPU资源是巨大的浪费。而现在就算是跑到1000 token/s，也能保持一定的batch size(吞吐量)了，终于也能产生商业价值了图里说在400 token/s的速度下，Rubin + LPX把吞吐提升了35倍，就是典型的token经济学，这个token高速度下，从Blackwell算提升了35倍的商业价值 ------- Nvidia补上了这块短板之后，对各个startup（比如Cerebras, d-Matrix, MatX, SambaNova）有什么影响呢？ startup最大的卖点就是特定场景下的速度优势，或者成本优势在大batch（多请求）场景下，GPU的算术密度(arithmetic intensity)越过ridge point之后利用率接近很高，成本/速度都对startup有显著的优势。所以这些startup能存活，最大的场景是：客户的workload集中在小batch、低延迟，速度极快，不在乎极高成本。GPU在这里效率极差，也达不到对应的token速度 Cerebras：极致的速度。wafer-scale尺寸巨大的SRAM(40GB)，消灭芯片间通信这个最大瓶颈，在小batch用户数量小的场景下token rate极高。但成本完全没有竞争力，一台CS-3系统价格230万美元，远超同等GPU集群，跟H100比是十几倍的成本换十几倍的速度。 d-Matrix ：高速度+小batch场景。in-memory compute减少data movement，在小batch decode下比GPU的利用率高，所以perf/watt在这个区间有一定竞争力。最近引入的3D stacked DRAM 是为了解决“更大的 reasoning model + 更高 token consumption”带来的容量/带宽继续扩展问题 SambaNova ：在企业私有化部署场景下，同时跑多个中小模型，GPU的利用率因为context switching损耗严重，SambaNova的RDU在这个场景下有更好的perf/dollar。本质上还是特定场景下的成本优势，通用速度优势并没有那么大 MatX：partitionable脉动阵列 + SRAM/HBM混合，和这次Nvidia的异构架构思路有相似的地方，最大的亮点是单芯片内实现Weights放在SRAM上，KV cache放在HBM上。但单芯片内省掉了前面提到的AFD的80层LPU-GPU芯片间通信，所以速度上仍然有一定优势，但Scalability可能不如GPU+LPU阵列了总之，在Rubin + LPX情境下，小batch、低延迟，速度极快这个以前的场景缺口补上了很多，各个startup的优势空间越来越缩小了 ------------- 前瞻里提到的speculative decoding用LPU做草稿模型，用GPU去验证，这样的加速幅度会非常大，这个猜想完全命中，这次在官方blog里有了浓墨重彩的一笔，专门用了一整个章节来介绍这个用法：“LPX generates draft tokens rapidly using its low-latency architecture. Rubin GPUs verify and finalize tokens efficiently” 另外一个前瞻里提到的CPX (Content Phase aXcelerator，一个专门为prefill的compute bound特性设计的计算模块)，似乎在这次GTC里完全消失了，一个字也没提，这是意味着CPX被彻底取消了吗？我觉得不一定目前的prefill和decode是disaggregated结构，也就是说一部分的GPU专门做prefill，另一部分专门做decode。CPX取代GPU做prefill从架构上来说是更合理的选择，可以加速prefill阶段，当然了会带来更高的成本，毕竟也是额外的一颗芯片 CPX和目前Nvidia的Rubin + LPX架构没有冲突的地方，仅仅只是简单的把做prefill的这部分GPU换成CPX而已，所以以后有速度优化需求的时候，也许CPX还会回来的 —------------------------------------------------------- 还是上篇的感慨，每一次计算范式的改变，半导体都会带来一波新的startup热潮，但当软件/应用形态逐渐收敛，最后还是变成了大厂通过收购把功能做大做全，参数做的更高，系统深度整合的更好更全面，成本更低，功耗和跑分更优秀，让startup慢慢失去独立生存的空间比如移动互联网时代早期，也是群雄并起，有做AP应用处理器，独立基带芯片的，ISP的，GPU的各种小公司。但最后的赢家，都是从到后来把GPU，ISP，modem全都做进SoC，并且完成系统级整合的异构计算平台。苹果收购PA semi的CPU，英飞凌的modem，掏空Imagination的GPU；高通收购ATI的mGPU，Atheros的Wifi，Nuvia的CPU，CSR的蓝牙/DSP，都是典型例子异构推理的复杂度越来越高，能做系统级整合的公司会更有优势，这和移动SoC时代的逻辑一模一样。AI时代nvidia收购arm(失败)，收购Mellanox，收购groq，只是这个新历史轮回的开始

显示更多