Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

2026-03-18 · Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

Dan Woods successfully ran a custom version of the Qwen3.5-397B-A17B Mixture-of-Experts (MoE) model on a 48GB MacBook Pro M3 Max, achieving 5.5+ tokens/second despite the model's 209GB disk size (120GB quantized). This was accomplished by implementing techniques from Apple's "LLM in a flash" paper, which optimizes LLM inference by storing parameters in flash memory and streaming them to DRAM on demand. Woods utilized Claude Code and an autoresearch pattern to generate MLX Objective-C and Metal code, running 90 experiments. The final model uses 4-bit quantized experts, with non-expert components like embedding tables and routing matrices remaining at original precision, occupying 5.5GB of resident memory. The setup reduced experts per token from 10 to 4, with 4-bit quantization proving crucial for maintaining tool-calling functionality.

Key takeaway

For NLP engineers optimizing large language models for local deployment on resource-constrained hardware, you should explore flash memory streaming techniques. Consider quantizing MoE experts to 4-bit while maintaining higher precision for critical non-expert components like embedding tables, as this balance can preserve essential functionalities such as tool calling, which 2-bit quantization may break.

Key insights

Efficient LLM inference on limited memory is possible by streaming expert weights from flash storage.

Principles

Stream expert weights from SSD
Quantize experts aggressively
Keep non-experts at higher precision

Method

An autoresearch pattern with an LLM (Claude Code) can generate and optimize MLX Objective-C and Metal code for efficient LLM inference, guided by an inference cost model.

In practice

Target 4-bit quantization for MoE experts
Prioritize tool-calling functionality
Retain original precision for non-expert parts

Topics

LLM in a Flash
Mixture-of-Experts
Model Quantization
Local LLM Inference
Apple M3 Max

Code references

danveloper/flash-moe

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.