Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction

· Source: InfoQ · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Internet of Things (IoT) & Connected Devices, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

Google's LiteRT-LM is a new runtime framework that significantly accelerates on-device inference for large language models, specifically Gemma 4, achieving up to 2.2x faster performance. Launched on June 5, 2026, LiteRT-LM provides native support for Gemma 4 Multi-Token Prediction (MTP) drafters and is built upon the LiteRT foundation, formerly TensorFlow Lite. Optimized for Android, iOS, and the web, it efficiently manages memory and hardware constraints using advanced quantization and accelerated XNNPACK/MLDrift kernels. The framework employs speculative decoding for MTP, optimizing data interplay and enforcing memory locality by running both the primary model and drafter on the same hardware IP. Benchmarks show MTP decoding speeds are 1.6x faster for Gemma 4 E2B and 2.2x faster for Gemma 4 E4B, outperforming competing frameworks by 1.8x to 3.7x. LiteRT-LM also features robust session management, memory efficiency (e.g., Gemma 4 E2B taking 607MB on Apple mobile CPUs), and agentic capabilities like constrained decoding and function-calling. It is available on GitHub, expanding to Swift and JavaScript APIs.

Key takeaway

For AI Engineers developing on-device LLM applications, LiteRT-LM offers a significant performance advantage. You should consider integrating LiteRT-LM to achieve up to 2.2x faster Gemma 4 inference and reduce memory footprint, especially for mobile and web deployments. This enables more responsive user experiences and supports complex agentic features like function-calling directly on edge devices. Evaluate its GitHub resources and new Swift/JavaScript APIs for your specific project needs.

Key insights

LiteRT-LM accelerates on-device LLM inference by up to 2.2x using MTP, memory locality, and optimized runtime for Gemma 4.

Principles

Method

LiteRT-LM uses speculative decoding for MTP, optimizing data interplay between the primary Gemma 4 model and the MTP drafter. It enforces memory locality by executing both on the same hardware IP, managing shared KV cache and activations locally.

In practice

Topics

Code references

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.