Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
Summary
This study systematically optimized real-time diffusion model inference on the Apple M3 Ultra (60-core GPU, 512 GB unified memory), aiming for real-time camera img2img transformation. Researchers conducted 10 phases of experiments, evaluating techniques like CoreML conversion, quantization, Token Merging, Neural Engine utilization, and compact models. They found that CoreML conversion was the only effective UNet acceleration technique, reducing inference time by 39%. Quantization, parallel inference, and Neural Engine utilization were ineffective or counterproductive due to the M3 Ultra's compute-bound nature and software stack design. Ultimately, combining CoreML conversion of the distillation-specialized SDXS-512 model with a 3-thread camera pipeline achieved 22.7 FPS at 512x512 resolution, demonstrating that optimization strategies effective on NVIDIA GPUs do not directly apply to Apple Silicon's unified memory architecture.
Key takeaway
For computer vision engineers optimizing diffusion models on Apple Silicon, you should focus on CoreML conversion and selecting models like SDXS-512 that are designed for efficient distillation. Your existing CUDA-centric optimization assumptions, such as the effectiveness of quantization or parallel inference, will likely not apply due to the M3 Ultra's unified memory architecture and software stack limitations. Prioritize architectural co-design and consider direct Metal Compute Shader programming for future performance gains.
Key insights
Apple Silicon's unified memory architecture requires distinct diffusion model optimization strategies compared to NVIDIA GPUs.
Principles
- Unified memory architectures are compute-bound, not memory-bandwidth-bound.
- Model architecture co-design is critical for hardware optimization feasibility.
- Low-level parallel control APIs are essential for effective parallel inference.
Method
A 3-thread camera pipeline combined with CoreML conversion of distillation-specialized models like SDXS-512 enables real-time img2img transformation on Apple M3 Ultra.
In practice
- Prioritize CoreML conversion for UNet acceleration on Apple Silicon.
- Use distillation-specialized models (e.g., SDXS-512) for speed and quality.
- Avoid quantization for speed on M3 Ultra; it offers no benefit.
Topics
- Diffusion Model Optimization
- Apple M3 Ultra
- CoreML Conversion
- Unified Memory Architecture
- SDXS-512
Code references
Best for: Research Scientist, Computer Vision Engineer, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.