Streaming experts
Summary
The "streaming experts" technique, which enables running large Mixture-of-Experts (MoE) models on hardware with insufficient RAM by streaming expert weights from SSD, is gaining traction. Initially, Dan Woods demonstrated running Qwen3.5-397B-A17B in 48GB of RAM. More recently, @seikixtc reported successfully running the 1 trillion parameter Kimi K2.5 model, which has 32B active weights, in 96GB of RAM on an M2 Max MacBook Pro. Additionally, @anemll showcased the Qwen3.5-397B-A17B model running on an iPhone, achieving a processing speed of 0.6 tokens/second. These developments highlight ongoing optimizations and the potential for broader deployment of large language models on consumer-grade hardware.
Key takeaway
For AI engineers and researchers exploring efficient deployment of large MoE models, consider implementing the "streaming experts" technique. This approach significantly reduces RAM requirements, enabling models like Kimi K2.5 (1T parameters) to run on consumer hardware such as an M2 Max MacBook Pro or even an iPhone, albeit with performance trade-offs. Your team could achieve broader accessibility for large models without specialized, high-memory GPUs.
Key insights
Streaming experts allows large MoE models to run on RAM-limited hardware by loading weights from SSD on demand.
Principles
- MoE models can be run with partial weight loading.
- SSD bandwidth can compensate for RAM limitations.
Method
The method involves dynamically loading only the necessary expert weights from SSD into RAM for each token processed, rather than fitting the entire model into memory.
In practice
- Run 1T parameter models on M2 Max MacBook Pro.
- Deploy 397B parameter models on iPhones.
Topics
- Streaming Experts
- Mixture-of-Experts
- Large Language Models
- On-device AI
- Model Optimization
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer, Deep Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.