The KV-Cache of Small MoEs: Qwen3, Qwen3.5, GLM 4.7 Flash, and Nemotron 3 Nano Compared

2026-03-18 · Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

This analysis examines the inference efficiency of four recent open Mixture-of-Experts (MoE) models: Qwen3.5-35B-A3B, GLM-4.7-Flash, Nemotron-3-Nano-30B-A3B, and Qwen3-30B-A3B-Instruct-2507. While all are MoE models, they employ distinct attention architectures to optimize for efficient deployment, particularly by reducing the memory footprint of the KV cache. The KV cache stores attention keys and values for previous tokens, enabling faster autoregressive generation but creating a significant memory bottleneck as context length and concurrency increase. The article details how each model deviates from a standard BF16 transformer with vanilla multi-head attention, providing specific formulas to estimate memory usage for different context lengths and concurrency levels. It compares grouped-query attention (GQA), compressed MLA-style caching, and hybrid designs that use full attention only in some layers.

Key takeaway

For AI Architects evaluating open MoE models for large-scale deployment, understanding each model's specific KV cache strategy is critical. Your choice impacts memory consumption and serving costs significantly. Prioritize models like Nemotron-3-Nano-30B-A3B or GLM-4.7-Flash if minimizing KV cache memory is paramount, as they offer substantial savings over conventional GQA approaches, especially at high context lengths and concurrency.

Key insights

MoE models achieve inference efficiency through diverse attention architectures and KV cache optimization strategies.

Principles

KV cache size is a primary driver of LLM serving cost.
Hybrid architectures can significantly reduce KV cache footprint.
Optimized runtimes are crucial for MLA-based memory savings.

Method

Estimate KV cache memory consumption using specific formulas that account for batch size, context length, number of layers, KV heads, and head dimension, adjusted for architectural variations like GQA, MLA, or hybrid designs.

In practice

Qwen3-30B-A3B-Instruct-2507 uses MoE + GQA for efficiency.
GLM-4.7-Flash employs MLA for compressed KV storage.
Nemotron-3-Nano-30B-A3B uses a hybrid Mamba-2/Transformer design.

Topics

LLM Inference Efficiency
Mixture-of-Experts Models
KV Cache Management
Attention Architectures
Memory Optimization

Best for: NLP Engineer, AI Architect, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.