The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
Summary
KV caching, a common optimization in autoregressive transformer inference, is not numerically equivalent to cache-free computation when using standard FP16 precision. This study demonstrates that cache-ON and cache-OFF execution paths utilize different floating-point accumulation orders, leading to deterministic divergence in decoded token sequences due to FP16 non-associativity. Across LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B models evaluated on GSM8K, a 100% token divergence rate was observed across all sampling strategies, including greedy decoding. Cache-ON inference yielded higher accuracy in 8 of 9 conditions. Controlled FP32 falsification reduced divergence by eight orders of magnitude and eliminated token flips, confirming FP16 non-associativity as the sole cause. Layer-wise drift profiling showed Grouped-Query Attention models diverge sharply at the first layer, while Gemma's architecture produced uniform accumulation across layers. Activation patching of the residual stream failed to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache.
Key takeaway
For CTOs and VPs of Engineering designing or deploying LLM inference systems, recognize that FP16 KV caching introduces deterministic, architecturally predictable numerical divergence, making cache-ON and cache-OFF paths non-equivalent. If strict numerical equivalence is critical, consider using FP32 for KV cache operations despite the performance cost, or explore future techniques like periodic cache refresh or high-precision cache. Be aware that GQA architectures amplify this divergence, impacting system reliability and output consistency.
Key insights
FP16 KV cache inference deterministically diverges from cache-free computation due to non-associative floating-point accumulation.
Principles
- FP16 non-associativity causes systematic numerical divergence.
- Divergence is architecturally predictable, not random.
- Residual stream interventions cannot correct KV cache divergence.
Method
The study used five experiments: behavioral characterization, layer drift analysis, FP32 falsification, decision boundary analysis, and activation patching to systematically analyze and localize FP16 KV cache divergence.
In practice
- FP32 precision eliminates KV cache divergence.
- Grouped-Query Attention (GQA) amplifies divergence effects.
- Divergence impact is highest near decision boundaries.
Topics
- KV Cache
- FP16 Inference
- Numerical Divergence
- Floating-Point Non-Associativity
- Grouped-Query Attention
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.