Blazing fast on-device GenAI with LiteRT-LM

· Source: Google Developers Blog - AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Google AI Edge's LiteRT-LM provides an optimized solution for deploying Gemma 4 on-device across various platforms, including Chrome, ChromeOS, and Pixel Watch. Utilizing LiteRT for inference, this engine supports advanced quantization, XNNPACK, and MLDrift kernels to manage memory and compute constraints. LiteRT-LM achieves high performance, with decode speeds up to 52 tokens/sec on Android GPUs, 56 tokens/sec on iOS, and 76 tokens/sec on MacBook Pro via WebGPU. It integrates Multi-Token Prediction (MTP) for up to a 2.2x speedup by optimizing data interplay and memory locality. The system also features advanced session management for seamless user continuity and efficient memory utilization, running the ~2.58GB Gemma 4 E2B model with a 607MB physical memory footprint on Apple mobile CPUs. Additionally, LiteRT-LM supports agentic workflows with Thinking Mode, constrained decoding for structured output, and native function-calling capabilities, now offering Swift and JavaScript APIs for broader integration.

Key takeaway

For NLP Engineers developing on-device GenAI applications, LiteRT-LM offers a robust framework to deploy Gemma 4 with high performance and efficiency. You should explore its Swift and JavaScript APIs to extend your applications to iOS and web platforms, leveraging features like Multi-Token Prediction and advanced session management to deliver fast, privacy-preserving user experiences while minimizing memory footprint.

Key insights

LiteRT-LM enables high-performance, memory-efficient on-device GenAI with Gemma 4 across diverse hardware and platforms.

Principles

Method

LiteRT-LM uses advanced quantization, XNNPACK/MLDrift kernels, and Multi-Token Prediction (MTP) with optimized data interplay and session management to achieve high-performance, memory-efficient on-device LLM inference.

In practice

Topics

Code references

Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.