Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction
Summary
Google has introduced a new Multi-Token Prediction (MTP) method to accelerate on-device inference for Gemini Nano v3 models on Pixel 9 and 10 series phones. This architecture retrofits a lightweight Transformer head, the MTP head, onto existing "frozen" Gemini Nano v3 models, eliminating the need for separate, memory-heavy drafter models. Unlike traditional speculative decoding, MTP integrates with the main model's final layers, utilizing its hidden states and KV cache in a "zero-copy architecture." This design reduces runtime memory footprint by 130MB per instance and eliminates drafter prefill latency. The approach ensures no degradation in the base model's capabilities or safety alignment, as incorrect drafts are discarded. Experiments show MTP delivers 50% or more speedups on Pixel 9 devices for tasks like AI Notification Summaries and Proofread, while also reducing energy consumption.
Key takeaway
For AI Engineers optimizing on-device LLM performance, this MTP approach offers a significant architectural shift. You should consider retrofitting lightweight drafting heads onto frozen production models to achieve substantial speedups and memory savings. This eliminates the need for separate drafters, reducing your memory footprint by 130MB and improving inference speed by over 50% on mobile devices like Pixel 9. It ensures backward compatibility and preserves base model capabilities, simplifying deployment of efficient on-device AI.
Key insights
Retrofitting Multi-Token Prediction onto frozen LLMs significantly accelerates on-device inference by integrating drafting with the main model's state.
Principles
- Integrate drafting directly into the main model.
- Utilize existing model states for efficiency.
- Frozen backbones ensure capability preservation.
Method
Attach a lightweight Transformer MTP head to a frozen LLM's final layers. Train only the MTP head to predict future tokens, utilizing the main model's hidden states and KV cache for verification.
In practice
- Accelerate on-device LLM features like summarization.
- Reduce memory footprint for mobile AI applications.
- Improve battery life for AI-powered phone features.
Topics
- Multi-Token Prediction
- On-device AI
- Gemini Nano
- Pixel devices
- LLM inference acceleration
- Speculative decoding
- Zero-copy architecture
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.