Running local models on Macs gets faster with Ollama's MLX support
Summary
Ollama, a local large language model runtime, has released preview support for Apple's open-source MLX machine learning framework, specifically for Apple Silicon Macs (M1 or later). This update, available in Ollama 0.19, also includes improved caching performance and support for Nvidia's NVFP4 model compression format, enhancing memory efficiency. These advancements are expected to significantly boost performance on compatible Macs, particularly those with M5-series GPUs, by leveraging Apple's unified memory architecture and Neural Accelerators. Currently, only Alibaba's 35 billion-parameter Qwen3.5 model is supported, requiring at least 32GB of RAM. This development comes as interest in local LLMs grows due to frustrations with cloud service costs and rate limits, despite local models still lagging behind frontier cloud models in benchmarks.
Key takeaway
For NLP engineers and developers experimenting with local LLMs on Apple Silicon Macs, this Ollama update is critical. Your M1-series or newer Mac, especially M5-series, can now run certain large models like Qwen3.5-35B-A3B more efficiently, potentially reducing reliance on costly cloud APIs. Consider upgrading to 32GB+ RAM to fully capitalize on the performance gains from MLX and NVFP4 support, and monitor Ollama for expanded model compatibility.
Key insights
Ollama's MLX integration and NVFP4 support significantly enhance local LLM performance on Apple Silicon Macs.
Principles
- Unified memory optimizes local LLM performance.
- Model compression improves memory efficiency.
Method
Ollama 0.19 integrates Apple's MLX framework and Nvidia's NVFP4 format to optimize memory usage and leverage Neural Accelerators on Apple Silicon Macs for faster local LLM inference.
In practice
- Run Qwen3.5-35B-A3B locally on Apple Silicon.
- Utilize 32GB+ RAM for optimal performance.
Topics
- Ollama
- Apple MLX
- Apple Silicon
- Local LLMs
- NVFP4
Code references
Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.