Gemma 4 Local Test | New Open LLM King?

2026-04-03 · Source: Venelin Valkov · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, extended

Summary

The Gemma 4 model, specifically the 26 billion parameter Mixture of Experts version, demonstrates strong performance, achieving 40-43 tokens per second on an M4 machine with 38 GB of RAM. Google's Gemma 4 release includes mobile-first AI variants, with effective 2 billion, effective 4 billion, and a dense 31 billion parameter model, all licensed under Apache 2. Smaller models offer a 128k context window, while larger ones provide 256k. The models are natively trained on 140 languages and boast improved multimodal abilities, particularly image understanding, and are optimized for agentic workflows with native support for function calling and structured JSON output. The article details running Gemma 4 via WAMA CPP, including tests for image identification, HTML generation, software architecture analysis, JSON data extraction from receipts, and chart data extraction, with generally impressive results, though chart data extraction showed some discrepancies compared to other models.

Key takeaway

For AI Engineers evaluating open-source large language models for local deployment, Gemma 4 presents a compelling option. Its Apache 2 license, strong multimodal capabilities (especially image understanding), and native support for agentic workflows and structured JSON output make it highly versatile. Consider testing the 26 billion parameter Mixture of Experts version for performance, but be mindful of hardware requirements, as 8GB of GPU VRAM is insufficient for larger variants.

Key insights

Gemma 4 offers strong multimodal capabilities and agentic workflow support under an Apache 2 license.

Principles

Apache 2 license enables broad use and fine-tuning.
Multimodal models excel in diverse tasks.
Agentic workflows benefit from native function calling.

Method

Run Gemma 4 locally using WAMA CPP, specifying the quantized 8-bit version for efficient inference. Integrate with OpenAI client for local API access, setting the model name and server URL.

In practice

Use Gemma 4 for local image understanding tasks.
Implement Gemma 4 for structured JSON output.
Explore Gemma 4 for agentic workflow development.

Topics

Gemma 4
WAMA CPP
Multimodal AI
Image Understanding
Agentic Workflows

Best for: Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Venelin Valkov.