[AINews] Good Friday
Summary
Google has released Gemma 4, a family of open multimodal models under the Apache 2.0 license, emphasizing its capabilities for reasoning, agentic workflows, multimodality, and on-device use. Available in E2B, E4B, 26B A4B (MoE), and 31B sizes, Gemma 4 supports over 140 languages and features a hybrid attention mechanism for long-context tasks up to 256K tokens. Day-zero ecosystem support was extensive, with integrations across vLLM, llama.cpp, Ollama, Intel hardware, Unsloth, and Hugging Face. Local inference benchmarks show the 26B A4B MoE model achieving 162 tok/s decode on a single RTX 4090 with 19.5 GB VRAM, and even running on devices like a Mac mini M4 with 16 GB RAM. While early benchmarking discourse was positive, some users noted issues with llama.cpp implementation and context handling, and comparisons with Qwen3.5 models showed mixed results.
Key takeaway
For AI engineers and CTOs evaluating new open-source models for local or edge deployments, Gemma 4 presents a compelling option due to its Apache 2.0 license, multimodal capabilities, and strong day-zero ecosystem support. You should prioritize testing its 26B A4B MoE variant for efficiency on consumer GPUs and consider its integration with existing tools like llama.cpp and Unsloth, while being mindful of early-stage tokenizer and context handling issues reported with some local implementations.
Key insights
Gemma 4's open-source release and broad ecosystem support enable powerful, efficient multimodal AI on diverse hardware.
Principles
- Open-source models drive rapid ecosystem integration.
- Harness engineering is critical for agent performance.
- Local inference capability expands AI accessibility.
Method
Self-distillation without correctness filtering can significantly improve coding model performance, as demonstrated by Apple's Simple Self-Distillation (SSD) on Qwen3-30B-Instruct, boosting pass@1 from 42.4% to 55.3% on LiveCodeBench.
In practice
- Run Gemma 4 locally on consumer hardware for agentic workflows.
- Explore Hermes Agent for stable, capable open-source agent harnesses.
- Use .md/.html artifacts and Obsidian for agent context preservation.
Topics
- Gemma 4
- Hermes Agent
- AI Agent Harnesses
- Local LLM Inference
- Claude Emotion Vectors
Code references
Best for: CTO, VP of Engineering/Data, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.