Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

2026-06-26 · Source: The latest research from Google · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Internet of Things (IoT) & Connected Devices · Depth: Advanced, medium

Summary

Google has introduced a new Multi-Token Prediction (MTP) method to accelerate on-device inference for Gemini Nano v3 models on Pixel 9 and 10 series phones. This architecture retrofits a lightweight Transformer head, the MTP head, onto existing "frozen" Gemini Nano v3 models, eliminating the need for separate, memory-heavy drafter models. Unlike traditional speculative decoding, MTP integrates with the main model's final layers, utilizing its hidden states and KV cache in a "zero-copy architecture." This design reduces runtime memory footprint by 130MB per instance and eliminates drafter prefill latency. The approach ensures no degradation in the base model's capabilities or safety alignment, as incorrect drafts are discarded. Experiments show MTP delivers 50% or more speedups on Pixel 9 devices for tasks like AI Notification Summaries and Proofread, while also reducing energy consumption.

Key takeaway

For AI Engineers optimizing on-device LLM performance, this MTP approach offers a significant architectural shift. You should consider retrofitting lightweight drafting heads onto frozen production models to achieve substantial speedups and memory savings. This eliminates the need for separate drafters, reducing your memory footprint by 130MB and improving inference speed by over 50% on mobile devices like Pixel 9. It ensures backward compatibility and preserves base model capabilities, simplifying deployment of efficient on-device AI.

Key insights

Retrofitting Multi-Token Prediction onto frozen LLMs significantly accelerates on-device inference by integrating drafting with the main model's state.

Principles

Integrate drafting directly into the main model.
Utilize existing model states for efficiency.
Frozen backbones ensure capability preservation.

Method

Attach a lightweight Transformer MTP head to a frozen LLM's final layers. Train only the MTP head to predict future tokens, utilizing the main model's hidden states and KV cache for verification.

In practice

Accelerate on-device LLM features like summarization.
Reduce memory footprint for mobile AI applications.
Improve battery life for AI-powered phone features.

Topics

Multi-Token Prediction
On-device AI
Gemini Nano
Pixel devices
LLM inference acceleration
Speculative decoding
Zero-copy architecture

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.