Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

· Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

Dan Woods successfully ran a custom version of the Qwen3.5-397B-A17B Mixture-of-Experts (MoE) model on a 48GB MacBook Pro M3 Max, achieving 5.5+ tokens/second despite the model's 209GB disk size (120GB quantized). This was accomplished by implementing techniques from Apple's "LLM in a flash" paper, which optimizes LLM inference by storing parameters in flash memory and streaming them to DRAM on demand. Woods utilized Claude Code and an autoresearch pattern to generate MLX Objective-C and Metal code, running 90 experiments. The final model uses 4-bit quantized experts, with non-expert components like embedding tables and routing matrices remaining at original precision, occupying 5.5GB of resident memory. The setup reduced experts per token from 10 to 4, with 4-bit quantization proving crucial for maintaining tool-calling functionality.

Key takeaway

For NLP engineers optimizing large language models for local deployment on resource-constrained hardware, you should explore flash memory streaming techniques. Consider quantizing MoE experts to 4-bit while maintaining higher precision for critical non-expert components like embedding tables, as this balance can preserve essential functionalities such as tool calling, which 2-bit quantization may break.

Key insights

Efficient LLM inference on limited memory is possible by streaming expert weights from flash storage.

Principles

Method

An autoresearch pattern with an LLM (Claude Code) can generate and optimize MLX Objective-C and Metal code for efficient LLM inference, guided by an inference cost model.

In practice

Topics

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.