Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally
Summary
Dan Woods successfully ran a custom version of the Qwen3.5-397B-A17B Mixture-of-Experts (MoE) model on a 48GB MacBook Pro M3 Max, achieving 5.5+ tokens/second despite the model's 209GB disk size (120GB quantized). This was accomplished by implementing techniques from Apple's "LLM in a flash" paper, which optimizes LLM inference by storing parameters in flash memory and streaming them to DRAM on demand. Woods utilized Claude Code and an autoresearch pattern to generate MLX Objective-C and Metal code, running 90 experiments. The final model uses 4-bit quantized experts, with non-expert components like embedding tables and routing matrices remaining at original precision, occupying 5.5GB of resident memory. The setup reduced experts per token from 10 to 4, with 4-bit quantization proving crucial for maintaining tool-calling functionality.
Key takeaway
For NLP engineers optimizing large language models for local deployment on resource-constrained hardware, you should explore flash memory streaming techniques. Consider quantizing MoE experts to 4-bit while maintaining higher precision for critical non-expert components like embedding tables, as this balance can preserve essential functionalities such as tool calling, which 2-bit quantization may break.
Key insights
Efficient LLM inference on limited memory is possible by streaming expert weights from flash storage.
Principles
- Stream expert weights from SSD
- Quantize experts aggressively
- Keep non-experts at higher precision
Method
An autoresearch pattern with an LLM (Claude Code) can generate and optimize MLX Objective-C and Metal code for efficient LLM inference, guided by an inference cost model.
In practice
- Target 4-bit quantization for MoE experts
- Prioritize tool-calling functionality
- Retain original precision for non-expert parts
Topics
- LLM in a Flash
- Mixture-of-Experts
- Model Quantization
- Local LLM Inference
- Apple M3 Max
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.