NEW GEMMA 4 beats GPT-5.4: The A4B Model
Summary
Google has released the Gemma 4 series of open-source models under an Apache 2 license, including 2B, 4B, 26B Mixture-of-Experts (MoE), and 31B dense models. This analysis focuses on live testing the 26B MoE (which activates 3.88B parameters) and the 31B dense model using a complex "elevator puzzle" designed to assess causal reasoning and logical problem-solving without external tools. The 4B active MoE model consistently demonstrated superior self-reflection, strategic planning, and constraint adherence, ultimately finding a valid 10-button press solution. In contrast, the 31B dense model struggled with optimization, often getting stuck in local minima and violating puzzle constraints, leading to invalid or suboptimal solutions. The 4B MoE's performance rivaled or exceeded larger proprietary models like GPT-5.4 (non-X-High) on this specific task.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating open-source LLMs for complex logical tasks, you should prioritize the Gemma 4 26B MoE (4B active) model. Its demonstrated self-correction and strategic planning capabilities make it a strong contender for applications requiring robust causal reasoning, potentially outperforming larger dense models and even some proprietary alternatives on such challenges. Consider its 31B dense counterpart primarily as a foundation for extensive fine-tuning.
Key insights
Gemma 4's 4B active MoE model excels in complex logical reasoning and self-correction, outperforming its 31B dense counterpart.
Principles
- Smaller MoE models can surpass larger dense models in complex reasoning.
- Self-reflection and iterative checking are crucial for robust problem-solving.
- Pure intelligence testing should exclude external agentic tools.
Method
The "elevator puzzle" assesses LLM causal reasoning by requiring shortest path optimization under complex mathematical button functions, energy limits, and floor caps, without external solvers.
In practice
- Prioritize Gemma 4 4B MoE for tasks requiring deep logical reasoning.
- Use Gemma 4 31B as a base for fine-tuning specific domain tasks.
- Design evaluation puzzles with complex constraints to reveal model intelligence.
Topics
- Gemma 4 Models
- Mixture-of-Experts
- Complex Logic Puzzles
- Causal Reasoning
- LLM Performance Benchmarking
Best for: AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.