Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

2026-04-20 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

Apple's MLX framework enables running large language models (LLMs) like Gemma 4 directly on iPhones, iPads, and Macs, leveraging Apple Silicon optimization. The Locally AI chatbot, developed by Adria, demonstrates this capability by allowing users to run on-device models. Developers can integrate LLMs into iOS, iPadOS, and macOS apps using the MLX Swift LM GitHub repository, which simplifies model downloading and execution with a straightforward API. The MLX ecosystem is expanding to support various omni-models, including audio and visual tasks. Models, often quantized to 4-bit or 8-bit for efficiency, are sourced from the active MLX community on Hugging Face, which hosts thousands of optimized model weights. Performance is robust, with Gemma 4 8-bit quantized to 4-bit achieving 40 tokens per second on recent iPhones, and even older devices can reach 20 tokens per second.

Key takeaway

For AI engineers developing mobile applications, MLX Swift LM offers a streamlined path to integrate and run LLMs directly on Apple devices. You should prioritize quantized models (4-bit to 8-bit) from the Hugging Face MLX community to ensure optimal performance and device compatibility, enabling robust offline AI capabilities in your apps.

Key insights

MLX framework optimizes LLM inference on Apple Silicon, enabling efficient on-device AI applications.

Principles

Quantization is crucial for on-device LLM performance.
Apple Silicon optimizes MLX framework execution.

Method

Integrate MLX Swift LM into an iOS/macOS app, select a quantized model from the Hugging Face MLX community, and pass its ID to the framework for direct download and execution.

In practice

Use MLX Swift LM for iOS/macOS LLM integration.
Target 4-bit to 8-bit quantization for mobile.
Explore Hugging Face MLX community for models.

Topics

MLX Framework
On-Device LLMs
Gemma 4
Model Quantization
Hugging Face MLX Community

Best for: AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.