Google's Gemma 4 AI models get 3x speed boost by predicting future tokens

· Source: AI - Ars Technica · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Emerging Technologies & Innovation · Depth: Intermediate, short

Summary

Google has launched Multi-Token Prediction (MTP) drafters for its Gemma 4 open models, aiming to significantly accelerate local AI inference. Gemma 4, built on the same technology as Gemini AI but optimized for local execution, now operates under a more permissive Apache 2.0 license. MTP utilizes a form of speculative decoding where a lightweight drafter (e.g., 74 million parameters in Gemma 4 E2B) generates speculative tokens, which the main Gemma model then verifies in parallel. This process allows the system to produce multiple tokens in the time it previously took to generate one, effectively bypassing memory bandwidth limitations common in consumer hardware. Google reports MTP can make Gemma models up to three times faster, with specific gains like 2.8x and 3.1x on Pixel phones for E2B and E4B models, and a 2.5x boost for Gemma 4 31B on Apple M4 silicon, all without quality degradation.

Key takeaway

For NLP Engineers deploying Gemma 4 models on edge or consumer hardware, integrating Multi-Token Prediction (MTP) drafters is crucial for achieving substantial inference speedups. Your existing Gemma 4 deployments can see up to a 3x performance increase without compromising output quality, directly addressing common memory bandwidth bottlenecks. Explore the MTP-enabled models available through MLX, VLLM, SGLang, or Ollama to optimize your local AI applications and enhance user experience.

Key insights

Multi-Token Prediction (MTP) significantly accelerates local AI inference for Gemma 4 models via speculative decoding without quality loss.

Principles

Method

MTP uses a small drafter to generate speculative tokens, which the main model verifies in parallel, accepting sequences and generating an additional token simultaneously.

In practice

Topics

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.