Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI
Summary
Apple's MLX framework enables running large language models (LLMs) like Gemma 4 directly on iPhones, iPads, and Macs, leveraging Apple Silicon optimization. The Locally AI chatbot, developed by Adria, demonstrates this capability by allowing users to run on-device models. Developers can integrate LLMs into iOS, iPadOS, and macOS apps using the MLX Swift LM GitHub repository, which simplifies model downloading and execution with a straightforward API. The MLX ecosystem is expanding to support various omni-models, including audio and visual tasks. Models, often quantized to 4-bit or 8-bit for efficiency, are sourced from the active MLX community on Hugging Face, which hosts thousands of optimized model weights. Performance is robust, with Gemma 4 8-bit quantized to 4-bit achieving 40 tokens per second on recent iPhones, and even older devices can reach 20 tokens per second.
Key takeaway
For AI engineers developing mobile applications, MLX Swift LM offers a streamlined path to integrate and run LLMs directly on Apple devices. You should prioritize quantized models (4-bit to 8-bit) from the Hugging Face MLX community to ensure optimal performance and device compatibility, enabling robust offline AI capabilities in your apps.
Key insights
MLX framework optimizes LLM inference on Apple Silicon, enabling efficient on-device AI applications.
Principles
- Quantization is crucial for on-device LLM performance.
- Apple Silicon optimizes MLX framework execution.
Method
Integrate MLX Swift LM into an iOS/macOS app, select a quantized model from the Hugging Face MLX community, and pass its ID to the framework for direct download and execution.
In practice
- Use MLX Swift LM for iOS/macOS LLM integration.
- Target 4-bit to 8-bit quantization for mobile.
- Explore Hugging Face MLX community for models.
Topics
- MLX Framework
- On-Device LLMs
- Gemma 4
- Model Quantization
- Hugging Face MLX Community
Best for: AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.