Gemma 4 12B MTP Local Test | Coding, OCR, Visual RAG with llama.cpp

2026-06-14 · Source: Venelin Valkov · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

Google has introduced the Gemma 4 12B MTP model, a unified multimodal model capable of processing images, text, video, and audio. This 12 billion parameter model, which integrates vision and audio inputs directly into its backbone without separate encoders, aims for efficiency and performance nearing the 26 billion parameter Mixture of Experts (MoE) version. Local testing with llama.cpp on an M5 Pro with 48GB unified memory showed the 8-bit quantized version required approximately 19-20GB of memory, exceeding Google's stated 14GB. While it performed well in Visual RAG (Bulgarian fridge test) and receipt/financial document extraction (though with mixed results for tables), it struggled with complex coding tasks (SVG, game generation) and simple logic (car wash test). The model's output length was generally slim.

Key takeaway

For AI engineers evaluating local LLM deployment, if your tasks involve high-quality coding or complex document understanding, stick with larger Gemma 4 or 3.6 models. However, for "whiter tasks" like text-on-RAG or small agentic tool calling, the Gemma 4 12B MTP model might be a viable, memory-efficient option. Prioritize larger models with 4-bit quantization over the 12B 8-bit version for superior overall performance.

Key insights

The Gemma 4 12B MTP model offers multimodal capabilities and efficiency, nearing larger MoE models, but with performance trade-offs.

Principles

Direct multimodal input integration can enhance efficiency.
8-bit quantization can provide most of a model's power.
Smaller LLMs may "overthink" simple tasks, increasing latency.

Method

Run Gemma 4 12B GGUF with llama.cpp server using 8-bit KXL dynamic quantization, temperature 1, and Multi-Token Prediction (MTP) for 2-2.5x inference speed.

In practice

Test Gemma 4 12B for Visual RAG and basic document extraction.
Compare 8-bit 12B performance against 4-bit MoE models.
Utilize MTP for faster local inference with llama.cpp.

Topics

Gemma 4 12B MTP
llama.cpp
Multimodal AI
Model Quantization
Visual RAG
Document Understanding
Local LLM Inference

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Venelin Valkov.