Blazing fast on-device GenAI with LiteRT-LM
Summary
Google AI Edge's LiteRT-LM provides an optimized solution for deploying Gemma 4 on-device across various platforms, including Chrome, ChromeOS, and Pixel Watch. Utilizing LiteRT for inference, this engine supports advanced quantization, XNNPACK, and MLDrift kernels to manage memory and compute constraints. LiteRT-LM achieves high performance, with decode speeds up to 52 tokens/sec on Android GPUs, 56 tokens/sec on iOS, and 76 tokens/sec on MacBook Pro via WebGPU. It integrates Multi-Token Prediction (MTP) for up to a 2.2x speedup by optimizing data interplay and memory locality. The system also features advanced session management for seamless user continuity and efficient memory utilization, running the ~2.58GB Gemma 4 E2B model with a 607MB physical memory footprint on Apple mobile CPUs. Additionally, LiteRT-LM supports agentic workflows with Thinking Mode, constrained decoding for structured output, and native function-calling capabilities, now offering Swift and JavaScript APIs for broader integration.
Key takeaway
For NLP Engineers developing on-device GenAI applications, LiteRT-LM offers a robust framework to deploy Gemma 4 with high performance and efficiency. You should explore its Swift and JavaScript APIs to extend your applications to iOS and web platforms, leveraging features like Multi-Token Prediction and advanced session management to deliver fast, privacy-preserving user experiences while minimizing memory footprint.
Key insights
LiteRT-LM enables high-performance, memory-efficient on-device GenAI with Gemma 4 across diverse hardware and platforms.
Principles
- Optimize for memory locality to reduce data transfer latency.
- Support multi-platform deployment with a single codebase.
- Integrate speculative decoding for significant throughput gains.
Method
LiteRT-LM uses advanced quantization, XNNPACK/MLDrift kernels, and Multi-Token Prediction (MTP) with optimized data interplay and session management to achieve high-performance, memory-efficient on-device LLM inference.
In practice
- Deploy Gemma 4 E2B with 607MB RAM on Apple mobile CPUs.
- Achieve 2.2x speedup using Multi-Token Prediction (MTP).
- Enforce JSON schemas with constrained decoding for structured output.
Topics
- LiteRT-LM
- Gemma 4
- On-device GenAI
- Multi-Token Prediction
- Edge AI
Code references
Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.