Eric Xu (e/Mettā)(@xleaps):Wow! Exo can be patched to have a CUDA backend and now Apple Silicon and GB 10 can be used together (pipeline parallel) for inference. Not sure about the network bandwidth, but the sheer human ingenuity is amazing. Honestly Anthropic and OpenAI only have one or two years of the 50% interference margin (and that's all because of the memory shortage for civilians). Token margin will collapse as TSMC is pumping all these chips that can be connected. --- 给 exo 加上了 CUDA 后端后 mlx 和 cuda 设备实现了 pipeline 并行（把模型拆成前一半给一个节点后一半给另一个节点）现在个人使用本地推理的最大瓶颈已经不是算力, 而是存储；苹果设备的多设备联网推理已经走通如果能够走通足够带宽下的异构设备联网推理则大大降低本地推理成本催生尽管平均输出律低于数据中心 API, 但二十四小时使用开源模型不断推理的新用例

2026.03.26 08:43

Wow! Exo can be patched to have a CUDA backend and now Apple Silicon and GB 10 can be used together (pipeline parallel) for inference. Not sure about the network bandwidth, but the sheer human ingenuity is amazing. Honestly Anthropic and OpenAI only have one or two years of the 50% interference margin (and that's all because of the memory shortage for civilians). Token margin will collapse as TSMC is pumping all these chips that can be connected. --- 给 exo 加上了 CUDA 后端后 mlx 和 cuda 设备实现了 pipeline 并行（把模型拆成前一半给一个节点后一半给另一个节点）现在个人使用本地推理的最大瓶颈已经不是算力, 而是存储；苹果设备的多设备联网推理已经走通如果能够走通足够带宽下的异构设备联网推理则大大降低本地推理成本催生尽管平均输出律低于数据中心 API, 但二十四小时使用开源模型不断推理的新用例