Distributing LLM inference in DwarfStar

2026-05-25 · Source: List of posts - <antirez> · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, medium

Summary

Local LLM inference faces hardware limitations, with high-end NVIDIA cards being costly and Apple hardware like the Mac Studio M3 Ultra 512GB offering viable but constrained options, achieving 150 t/s prefill and 10-13 t/s decoding for DeepSeek v4 PRO at a ~12k total spending. The M5 Max 128GB, costing 6-7k, provides better performance at ~500 t/s prefill and 35-40 t/s decoding for models like DeepSeek v4 Flash. To overcome these constraints, the article explores distributed inference, detailing traditional methods such as sequential layer distribution and vertical splitting using Apple RDMA. It then introduces a novel approach: LLM ensembles, where multiple open-weight models run independently on different machines, combining logits or selecting the best continuation to enhance performance and knowledge, presenting this as a promising third alternative.

Key takeaway

For AI Engineers evaluating cost-effective LLM deployment strategies, consider distributed inference beyond traditional layer or expert splitting. Exploring LLM ensembles offers a promising path to enhance model performance and scale without expensive NVIDIA setups. You can run models independently and combine their outputs. Experiment with open-weight models like DeepSeek v4 Flash or Mimo V2.5 on multiple Apple Silicon devices to leverage this approach.

Key insights

Distributed LLM inference, particularly novel ensemble methods, can overcome local hardware limitations for large models.

Principles

Sequential layer distribution sends only activations.
LLM ensembles combine logits or select best continuation.
Ensembles improve model knowledge by diverse perspectives.

Method

LLM ensembles run multiple open-weight models independently on separate machines, then combine their logits or select the best continuation based on perplexity.

In practice

Run full-size DeepSeek v4 PRO using two Mac Studio 512GB.
Combine logits from different models for improved output.
Select ensemble output based on lower perplexity.

Topics

LLM Inference
Distributed Computing
Apple Silicon
Model Ensembles
DeepSeek v4 PRO
Quantization
M5 Max

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by List of posts - <antirez>.