The KV-Cache of Small MoEs: Qwen3, Qwen3.5, GLM 4.7 Flash, and Nemotron 3 Nano Compared
Summary
This analysis examines the inference efficiency of four recent open Mixture-of-Experts (MoE) models: Qwen3.5-35B-A3B, GLM-4.7-Flash, Nemotron-3-Nano-30B-A3B, and Qwen3-30B-A3B-Instruct-2507. While all are MoE models, they employ distinct attention architectures to optimize for efficient deployment, particularly by reducing the memory footprint of the KV cache. The KV cache stores attention keys and values for previous tokens, enabling faster autoregressive generation but creating a significant memory bottleneck as context length and concurrency increase. The article details how each model deviates from a standard BF16 transformer with vanilla multi-head attention, providing specific formulas to estimate memory usage for different context lengths and concurrency levels. It compares grouped-query attention (GQA), compressed MLA-style caching, and hybrid designs that use full attention only in some layers.
Key takeaway
For AI Architects evaluating open MoE models for large-scale deployment, understanding each model's specific KV cache strategy is critical. Your choice impacts memory consumption and serving costs significantly. Prioritize models like Nemotron-3-Nano-30B-A3B or GLM-4.7-Flash if minimizing KV cache memory is paramount, as they offer substantial savings over conventional GQA approaches, especially at high context lengths and concurrency.
Key insights
MoE models achieve inference efficiency through diverse attention architectures and KV cache optimization strategies.
Principles
- KV cache size is a primary driver of LLM serving cost.
- Hybrid architectures can significantly reduce KV cache footprint.
- Optimized runtimes are crucial for MLA-based memory savings.
Method
Estimate KV cache memory consumption using specific formulas that account for batch size, context length, number of layers, KV heads, and head dimension, adjusted for architectural variations like GQA, MLA, or hybrid designs.
In practice
- Qwen3-30B-A3B-Instruct-2507 uses MoE + GQA for efficiency.
- GLM-4.7-Flash employs MLA for compressed KV storage.
- Nemotron-3-Nano-30B-A3B uses a hybrid Mamba-2/Transformer design.
Topics
- LLM Inference Efficiency
- Mixture-of-Experts Models
- KV Cache Management
- Attention Architectures
- Memory Optimization
Best for: NLP Engineer, AI Architect, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.