Streaming experts

2026-03-24 · Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Emerging Technologies & Innovation · Depth: Intermediate, quick

Summary

The "streaming experts" technique, which enables running large Mixture-of-Experts (MoE) models on hardware with insufficient RAM by streaming expert weights from SSD, is gaining traction. Initially, Dan Woods demonstrated running Qwen3.5-397B-A17B in 48GB of RAM. More recently, @seikixtc reported successfully running the 1 trillion parameter Kimi K2.5 model, which has 32B active weights, in 96GB of RAM on an M2 Max MacBook Pro. Additionally, @anemll showcased the Qwen3.5-397B-A17B model running on an iPhone, achieving a processing speed of 0.6 tokens/second. These developments highlight ongoing optimizations and the potential for broader deployment of large language models on consumer-grade hardware.

Key takeaway

For AI engineers and researchers exploring efficient deployment of large MoE models, consider implementing the "streaming experts" technique. This approach significantly reduces RAM requirements, enabling models like Kimi K2.5 (1T parameters) to run on consumer hardware such as an M2 Max MacBook Pro or even an iPhone, albeit with performance trade-offs. Your team could achieve broader accessibility for large models without specialized, high-memory GPUs.

Key insights

Streaming experts allows large MoE models to run on RAM-limited hardware by loading weights from SSD on demand.

Principles

MoE models can be run with partial weight loading.
SSD bandwidth can compensate for RAM limitations.

Method

The method involves dynamically loading only the necessary expert weights from SSD into RAM for each token processed, rather than fitting the entire model into memory.

In practice

Run 1T parameter models on M2 Max MacBook Pro.
Deploy 397B parameter models on iPhones.

Topics

Streaming Experts
Mixture-of-Experts
Large Language Models
On-device AI
Model Optimization

Code references

Anemll/flash-moe

Best for: NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer, Deep Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.