Accelerating on-device AI: A look at Arm and Google AI Edge optimization
Summary
Arm and Google AI Edge have optimized on-device AI for multimodal capabilities, such as image and audio generation, by integrating Arm Scalable Matrix Extension 2 (SME2) directly into the CPU. This architecture allows the CPU to act as a high-performance AI accelerator, achieving up to 5x faster inference for matrix-heavy generative AI workloads. The Google AI Edge stack, including LiteRT, AI Edge Quantizer, and Model Explorer, streamlines development by automatically leveraging Arm SME2 through XNNPACK and Arm KleidiAI. This integration was demonstrated by optimizing Stability AI’s stable-audio-open-small model from PyTorch FP32 to mixed-precision (FP16/Int8), resulting in over 2x faster audio generation (e.g., 10s to 4.3s on Apple M4, 14s to 6.6s on Arm SME2 Android) and a 4x reduction in memory usage for the DiT submodel, all while maintaining audio quality.
Key takeaway
For AI Engineers developing on-device generative AI applications, you should explore the Google AI Edge stack with Arm SME2. This combination offers a streamlined path to achieve significant performance gains and memory reductions for models like Stable Audio Open Small, ensuring high-quality output on CPU-powered mobile devices. Leverage LiteRT, Model Explorer, and AI Edge Quantizer to convert, optimize, and deploy your models efficiently.
Key insights
Arm SME2 and Google AI Edge enable efficient, high-quality on-device generative AI inference on CPUs.
Principles
- Integrate matrix compute units into CPUs for AI acceleration.
- Automate hardware-specific optimizations via software stacks.
- Quantize models selectively to preserve quality.
Method
Convert PyTorch models to .tflite with LiteRT-Torch, optimize using Model Explorer for quantization-safe layers and AI Edge Quantizer for mixed-precision, then deploy with LiteRT leveraging XNNPACK and KleidiAI for Arm SME2 acceleration.
In practice
- Use LiteRT-Torch for PyTorch to .tflite conversion.
- Apply Model Explorer to visualize and identify quantization-safe layers.
- Utilize AI Edge Quantizer for FP32 to INT8 model optimization.
Topics
- Arm Scalable Matrix Extension 2
- Google AI Edge
- On-device AI Acceleration
- Model Quantization
- Generative Audio Models
Code references
- google-ai-edge/LiteRT
- google/XNNPACK
- google-ai-edge/ai-edge-quantizer
- google-ai-edge/litert-torch
- Arm-Examples/ML-examples
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.