The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

2026-04-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

KV caching, a common optimization in autoregressive transformer inference, is not numerically equivalent to cache-free computation when using standard FP16 precision. This discrepancy arises because cache-ON and cache-OFF execution paths use different floating-point accumulation orderings, leading to deterministic divergence in decoded token sequences due to FP16 non-associativity. Evaluating LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B on GSM8K, a 100% token divergence rate was observed across all sampling strategies, including greedy decoding. Cache-ON inference consistently yielded higher accuracy in 8 of 9 conditions. Controlled FP32 falsification reduced divergence by eight orders of magnitude, confirming FP16 non-associativity as the sole cause. Divergence patterns vary by model architecture, with Grouped-Query Attention models showing sharp divergence at the first layer, while Gemma exhibits uniform accumulation across layers.

Key takeaway

For AI Engineers optimizing LLM inference, recognize that FP16 KV caching introduces deterministic numerical divergence from cache-free computation, impacting decoded token sequences. You should evaluate the trade-offs between FP16 performance and numerical stability, potentially opting for FP32 for KV cache operations to ensure exact reproducibility and eliminate token flips, especially in sensitive applications where output consistency is critical.

Key insights

FP16 KV cache inference is fundamentally non-equivalent to recomputation due to non-associativity, causing deterministic token divergence.

Principles

FP16 non-associativity causes deterministic numerical divergence.
KV cache-ON can yield higher accuracy than cache-OFF.
Divergence patterns are architecturally predictable.

In practice

Use FP32 for KV cache to eliminate token divergence.
Profile layer-wise drift to understand divergence propagation.
Consider cache-ON for potential accuracy benefits.

Topics

KV Caching
FP16 Precision
Autoregressive Inference
Transformer Models
Numerical Instability

Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.