The KV-Cache of Small MoEs: Qwen3, Qwen3.5, GLM 4.7 Flash, and Nemotron 3 Nano Compared

· Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

This analysis examines the inference efficiency of four recent open Mixture-of-Experts (MoE) models: Qwen3.5-35B-A3B, GLM-4.7-Flash, Nemotron-3-Nano-30B-A3B, and Qwen3-30B-A3B-Instruct-2507. While all are MoE models, they employ distinct attention architectures to optimize for efficient deployment, particularly by reducing the memory footprint of the KV cache. The KV cache stores attention keys and values for previous tokens, enabling faster autoregressive generation but creating a significant memory bottleneck as context length and concurrency increase. The article details how each model deviates from a standard BF16 transformer with vanilla multi-head attention, providing specific formulas to estimate memory usage for different context lengths and concurrency levels. It compares grouped-query attention (GQA), compressed MLA-style caching, and hybrid designs that use full attention only in some layers.

Key takeaway

For AI Architects evaluating open MoE models for large-scale deployment, understanding each model's specific KV cache strategy is critical. Your choice impacts memory consumption and serving costs significantly. Prioritize models like Nemotron-3-Nano-30B-A3B or GLM-4.7-Flash if minimizing KV cache memory is paramount, as they offer substantial savings over conventional GQA approaches, especially at high context lengths and concurrency.

Key insights

MoE models achieve inference efficiency through diverse attention architectures and KV cache optimization strategies.

Principles

Method

Estimate KV cache memory consumption using specific formulas that account for batch size, context length, number of layers, KV heads, and head dimension, adjusted for architectural variations like GQA, MLA, or hybrid designs.

In practice

Topics

Best for: NLP Engineer, AI Architect, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.