Accelerating on-device AI: A look at Arm and Google AI Edge optimization

2026-05-14 · Source: Google Developers Blog - AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Internet of Things (IoT) & Connected Devices · Depth: Intermediate, medium

Summary

Arm and Google AI Edge have optimized on-device AI for multimodal capabilities, such as image and audio generation, by integrating Arm Scalable Matrix Extension 2 (SME2) directly into the CPU. This architecture allows the CPU to act as a high-performance AI accelerator, achieving up to 5x faster inference for matrix-heavy generative AI workloads. The Google AI Edge stack, including LiteRT, AI Edge Quantizer, and Model Explorer, streamlines development by automatically leveraging Arm SME2 through XNNPACK and Arm KleidiAI. This integration was demonstrated by optimizing Stability AI’s stable-audio-open-small model from PyTorch FP32 to mixed-precision (FP16/Int8), resulting in over 2x faster audio generation (e.g., 10s to 4.3s on Apple M4, 14s to 6.6s on Arm SME2 Android) and a 4x reduction in memory usage for the DiT submodel, all while maintaining audio quality.

Key takeaway

For AI Engineers developing on-device generative AI applications, you should explore the Google AI Edge stack with Arm SME2. This combination offers a streamlined path to achieve significant performance gains and memory reductions for models like Stable Audio Open Small, ensuring high-quality output on CPU-powered mobile devices. Leverage LiteRT, Model Explorer, and AI Edge Quantizer to convert, optimize, and deploy your models efficiently.

Key insights

Arm SME2 and Google AI Edge enable efficient, high-quality on-device generative AI inference on CPUs.

Principles

Integrate matrix compute units into CPUs for AI acceleration.
Automate hardware-specific optimizations via software stacks.
Quantize models selectively to preserve quality.

Method

Convert PyTorch models to .tflite with LiteRT-Torch, optimize using Model Explorer for quantization-safe layers and AI Edge Quantizer for mixed-precision, then deploy with LiteRT leveraging XNNPACK and KleidiAI for Arm SME2 acceleration.

In practice

Use LiteRT-Torch for PyTorch to .tflite conversion.
Apply Model Explorer to visualize and identify quantization-safe layers.
Utilize AI Edge Quantizer for FP32 to INT8 model optimization.

Topics

Arm Scalable Matrix Extension 2
Google AI Edge
On-device AI Acceleration
Model Quantization
Generative Audio Models

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.