Introducing the Third Generation of Apple’s Foundation Models
Summary
Apple introduced its third generation of Apple Foundation Models (AFM) on June 8, 2026, a family of five custom-built models developed with Google. These models power Apple Intelligence, deeply integrated into operating systems with privacy at its core, running on-device or via Private Cloud Compute. The family includes two on-device models: AFM 3 Core (3-billion-parameter dense) and AFM 3 Core Advanced (20-billion-parameter sparse, multimodal, activating 1-4 billion parameters). Server-based models include AFM 3 Cloud (workhorse), ADM 3 Cloud (Image) for generation/editing, and AFM 3 Cloud Pro (most capable), which extends Private Cloud Compute to NVIDIA GPUs in Google Cloud. Evaluations show AFM 3 Core improved text preference to 45.6% and AFM 3 Cloud to 64.7% over predecessors. AFM 3 Core Advanced achieved a 4.15 MOS for expressive voices and 44.7% preference for dictation quality.
Key takeaway
For AI Architects and Machine Learning Engineers designing privacy-centric, high-performance AI systems, Apple's third-generation Foundation Models offer a blueprint. You should consider sparse activation architectures like IFP for on-device scalability, enabling large models on consumer hardware by managing memory efficiently. Additionally, explore secure server-side inference via Private Cloud Compute, potentially extending to hybrid cloud environments with partners like Google and NVIDIA, to balance advanced capabilities with stringent data privacy guarantees.
Key insights
Apple's third-gen foundation models use novel sparse architectures and Private Cloud Compute for enhanced on-device and server AI with privacy.
Principles
- Store full models in flash, load experts into DRAM per prompt for on-device scalability.
- Implement architectural refinements like PT-MoE for server-side multimodal reasoning.
- Train with diverse data, excluding user private data, and respect publisher opt-out.
Method
AFM 3 Core Advanced uses Instruction-Following Pruning (IFP) to store the full model in flash memory (NAND) and selectively load a small subset of "experts" into DRAM per prompt, periodically reselecting them during generation.
In practice
- Utilize sparse activation architectures to scale large models beyond traditional DRAM limits on consumer hardware.
- Employ Quantization Aware Training to compress models while maintaining high accuracy for target hardware.
Topics
- Apple Foundation Models
- On-device AI
- Private Cloud Compute
- Sparse Architectures
- Multimodal AI
- Responsible AI
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.