Gemma 4
Summary
Google DeepMind has released Gemma 4, a new family of open-weight, multimodal models under the Apache 2.0 license, explicitly designed for reasoning, agentic workflows, and local/edge deployment. This release includes four model sizes: 31B dense, 26B MoE (~4B active parameters), and two edge-optimized models (E4B, E2B) with native multimodal support for text, vision, and audio. Key features include function calling, structured JSON output, and long context up to 256K tokens. Early benchmarks position Gemma-4-31B as a top-tier open model, with notable performance in scientific reasoning (GPQA Diamond 85.7%). The release saw immediate ecosystem support across major local and serving stacks like llama.cpp, Ollama, and vLLM, with impressive local inference performance anecdotes, including 300 t/s on an M2 Ultra. Architectural notes highlight hybrid attention, MoE blocks as separate layers, and efficiency tricks, though some suggest the leap is more in training data than architecture.
Key takeaway
For CTOs and VPs of Engineering evaluating open-source AI models for agentic workflows or edge deployment, Gemma 4's Apache 2.0 license, multimodal capabilities, and strong benchmark performance make it a compelling option. Its rapid ecosystem integration and optimized local inference suggest a lower barrier to adoption and faster time-to-market for new applications. You should consider prototyping with Gemma 4 for projects requiring robust reasoning and on-device execution, especially given its competitive performance against larger models.
Key insights
Gemma 4 offers powerful, open-weight multimodal AI with strong local deployment and agentic capabilities under an Apache 2.0 license.
Principles
- Open-weight models drive rapid ecosystem integration.
- Hybrid architectures balance performance and efficiency.
- Training data quality significantly impacts model capability.
Method
Gemma 4 utilizes a hybrid attention mechanism, MoE blocks as separate layers, and techniques like Proportional RoPE for memory optimization, enabling efficient multimodal processing and long-context handling.
In practice
- Deploy Gemma 4 locally using llama.cpp or Ollama for edge applications.
- Utilize Gemma 4's function calling for structured agentic workflows.
- Explore MoE variants for large-model quality at reduced inference cost.
Topics
- Gemma 4
- Open-weight AI Models
- Multimodal AI
- AI Agents
- Model Architecture
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AINews.