Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction
Summary
Google's LiteRT-LM is a new runtime framework that significantly accelerates on-device inference for large language models, specifically Gemma 4, achieving up to 2.2x faster performance. Launched on June 5, 2026, LiteRT-LM provides native support for Gemma 4 Multi-Token Prediction (MTP) drafters and is built upon the LiteRT foundation, formerly TensorFlow Lite. Optimized for Android, iOS, and the web, it efficiently manages memory and hardware constraints using advanced quantization and accelerated XNNPACK/MLDrift kernels. The framework employs speculative decoding for MTP, optimizing data interplay and enforcing memory locality by running both the primary model and drafter on the same hardware IP. Benchmarks show MTP decoding speeds are 1.6x faster for Gemma 4 E2B and 2.2x faster for Gemma 4 E4B, outperforming competing frameworks by 1.8x to 3.7x. LiteRT-LM also features robust session management, memory efficiency (e.g., Gemma 4 E2B taking 607MB on Apple mobile CPUs), and agentic capabilities like constrained decoding and function-calling. It is available on GitHub, expanding to Swift and JavaScript APIs.
Key takeaway
For AI Engineers developing on-device LLM applications, LiteRT-LM offers a significant performance advantage. You should consider integrating LiteRT-LM to achieve up to 2.2x faster Gemma 4 inference and reduce memory footprint, especially for mobile and web deployments. This enables more responsive user experiences and supports complex agentic features like function-calling directly on edge devices. Evaluate its GitHub resources and new Swift/JavaScript APIs for your specific project needs.
Key insights
LiteRT-LM accelerates on-device LLM inference by up to 2.2x using MTP, memory locality, and optimized runtime for Gemma 4.
Principles
- Co-locate drafter and primary model on same hardware IP.
- Optimize KV cache and activations for local memory.
- Speculative decoding reduces data movement.
Method
LiteRT-LM uses speculative decoding for MTP, optimizing data interplay between the primary Gemma 4 model and the MTP drafter. It enforces memory locality by executing both on the same hardware IP, managing shared KV cache and activations locally.
In practice
- Deploy Gemma 4 models on Android, iOS, web.
- Implement agentic capabilities with constrained decoding.
- Use LiteRT-LM for efficient on-device LLM inference.
Topics
- On-device Inference
- Gemma 4
- Multi-Token Prediction
- LiteRT-LM
- Speculative Decoding
- LLM Optimization
- Edge AI
Code references
Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.