Running local models on Macs gets faster with Ollama's MLX support

2026-03-31 · Source: AI - Ars Technica · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

Ollama, a local large language model runtime, has released preview support for Apple's open-source MLX machine learning framework, specifically for Apple Silicon Macs (M1 or later). This update, available in Ollama 0.19, also includes improved caching performance and support for Nvidia's NVFP4 model compression format, enhancing memory efficiency. These advancements are expected to significantly boost performance on compatible Macs, particularly those with M5-series GPUs, by leveraging Apple's unified memory architecture and Neural Accelerators. Currently, only Alibaba's 35 billion-parameter Qwen3.5 model is supported, requiring at least 32GB of RAM. This development comes as interest in local LLMs grows due to frustrations with cloud service costs and rate limits, despite local models still lagging behind frontier cloud models in benchmarks.

Key takeaway

For NLP engineers and developers experimenting with local LLMs on Apple Silicon Macs, this Ollama update is critical. Your M1-series or newer Mac, especially M5-series, can now run certain large models like Qwen3.5-35B-A3B more efficiently, potentially reducing reliance on costly cloud APIs. Consider upgrading to 32GB+ RAM to fully capitalize on the performance gains from MLX and NVFP4 support, and monitor Ollama for expanded model compatibility.

Key insights

Ollama's MLX integration and NVFP4 support significantly enhance local LLM performance on Apple Silicon Macs.

Principles

Unified memory optimizes local LLM performance.
Model compression improves memory efficiency.

Method

Ollama 0.19 integrates Apple's MLX framework and Nvidia's NVFP4 format to optimize memory usage and leverage Neural Accelerators on Apple Silicon Macs for faster local LLM inference.

In practice

Run Qwen3.5-35B-A3B locally on Apple Silicon.
Utilize 32GB+ RAM for optimal performance.

Topics

Ollama
Apple MLX
Apple Silicon
Local LLMs
NVFP4

Code references

openclaw/openclaw

Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.