Ollama is now powered by MLX on Apple Silicon in preview

2026-03-29 · Source: Ollama Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

Ollama has released a preview version, 0.19, that significantly boosts performance on Apple Silicon devices by integrating Apple's MLX machine learning framework. This update leverages the unified memory architecture and, on M5, M5 Pro, and M5 Max chips, utilizes new GPU Neural Accelerators to enhance both time to first token (TTFT) and generation speed. Benchmarking on March 29, 2026, with Alibaba's Qwen3.5-35B-A3B model, showed Ollama 0.19 achieving 1810 tokens/s prefill and 134 tokens/s decode with NVFP4 quantization, compared to Ollama 0.18's 1154 tokens/s prefill. The new version also introduces support for NVIDIA's NVFP4 format for improved model accuracy and reduced memory footprint, along with an upgraded caching system for better responsiveness in agentic and coding tasks.

Key takeaway

For NLP Engineers and developers running local LLMs on Apple Silicon, upgrading to Ollama 0.19 is crucial for substantial performance gains, especially for coding agents and personal assistants. This update, powered by MLX and NVFP4, offers faster inference and improved memory efficiency, making it easier to deploy large models like Qwen3.5-35B-A3B locally. Ensure your Mac has over 32GB of unified memory to fully benefit from these enhancements.

Key insights

Ollama 0.19 integrates Apple's MLX and NVIDIA's NVFP4 for faster, more efficient local LLM inference on Apple Silicon.

Principles

Unified memory architecture enhances ML performance.
Low-precision formats like NVFP4 balance accuracy and efficiency.

Method

Ollama 0.19 uses MLX for Apple Silicon acceleration and NVFP4 for efficient quantization, coupled with an intelligent caching system that reuses cache across conversations and stores checkpoints.

In practice

Run Qwen3.5-35B-A3B for coding tasks.
Utilize NVFP4 for production parity.
Requires Mac with >32GB unified memory.

Topics

Ollama
Apple Silicon
MLX Framework
NVFP4 Quantization
Qwen3.5-35B-A3B Model

Code references

NVIDIA/Model-Optimizer

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.