Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

· Source: The latest research from Google · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Internet of Things (IoT) & Connected Devices · Depth: Advanced, medium

Summary

Google has introduced a new Multi-Token Prediction (MTP) method to accelerate on-device inference for Gemini Nano v3 models on Pixel 9 and 10 series phones. This architecture retrofits a lightweight Transformer head, the MTP head, onto existing "frozen" Gemini Nano v3 models, eliminating the need for separate, memory-heavy drafter models. Unlike traditional speculative decoding, MTP integrates with the main model's final layers, utilizing its hidden states and KV cache in a "zero-copy architecture." This design reduces runtime memory footprint by 130MB per instance and eliminates drafter prefill latency. The approach ensures no degradation in the base model's capabilities or safety alignment, as incorrect drafts are discarded. Experiments show MTP delivers 50% or more speedups on Pixel 9 devices for tasks like AI Notification Summaries and Proofread, while also reducing energy consumption.

Key takeaway

For AI Engineers optimizing on-device LLM performance, this MTP approach offers a significant architectural shift. You should consider retrofitting lightweight drafting heads onto frozen production models to achieve substantial speedups and memory savings. This eliminates the need for separate drafters, reducing your memory footprint by 130MB and improving inference speed by over 50% on mobile devices like Pixel 9. It ensures backward compatibility and preserves base model capabilities, simplifying deployment of efficient on-device AI.

Key insights

Retrofitting Multi-Token Prediction onto frozen LLMs significantly accelerates on-device inference by integrating drafting with the main model's state.

Principles

Method

Attach a lightweight Transformer MTP head to a frozen LLM's final layers. Train only the MTP head to predict future tokens, utilizing the main model's hidden states and KV cache for verification.

In practice

Topics

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.