Distributing LLM inference in DwarfStar
Summary
Local LLM inference faces hardware limitations, with high-end NVIDIA cards being costly and Apple hardware like the Mac Studio M3 Ultra 512GB offering viable but constrained options, achieving 150 t/s prefill and 10-13 t/s decoding for DeepSeek v4 PRO at a ~12k total spending. The M5 Max 128GB, costing 6-7k, provides better performance at ~500 t/s prefill and 35-40 t/s decoding for models like DeepSeek v4 Flash. To overcome these constraints, the article explores distributed inference, detailing traditional methods such as sequential layer distribution and vertical splitting using Apple RDMA. It then introduces a novel approach: LLM ensembles, where multiple open-weight models run independently on different machines, combining logits or selecting the best continuation to enhance performance and knowledge, presenting this as a promising third alternative.
Key takeaway
For AI Engineers evaluating cost-effective LLM deployment strategies, consider distributed inference beyond traditional layer or expert splitting. Exploring LLM ensembles offers a promising path to enhance model performance and scale without expensive NVIDIA setups. You can run models independently and combine their outputs. Experiment with open-weight models like DeepSeek v4 Flash or Mimo V2.5 on multiple Apple Silicon devices to leverage this approach.
Key insights
Distributed LLM inference, particularly novel ensemble methods, can overcome local hardware limitations for large models.
Principles
- Sequential layer distribution sends only activations.
- LLM ensembles combine logits or select best continuation.
- Ensembles improve model knowledge by diverse perspectives.
Method
LLM ensembles run multiple open-weight models independently on separate machines, then combine their logits or select the best continuation based on perplexity.
In practice
- Run full-size DeepSeek v4 PRO using two Mac Studio 512GB.
- Combine logits from different models for improved output.
- Select ensemble output based on lower perplexity.
Topics
- LLM Inference
- Distributed Computing
- Apple Silicon
- Model Ensembles
- DeepSeek v4 PRO
- Quantization
- M5 Max
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by List of posts - <antirez>.