The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

KV caching, a common optimization in autoregressive transformer inference, is not numerically equivalent to cache-free computation when using standard FP16 precision. This study demonstrates that cache-ON and cache-OFF execution paths utilize different floating-point accumulation orders, leading to deterministic divergence in decoded token sequences due to FP16 non-associativity. Across LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B models evaluated on GSM8K, a 100% token divergence rate was observed across all sampling strategies, including greedy decoding. Cache-ON inference yielded higher accuracy in 8 of 9 conditions. Controlled FP32 falsification reduced divergence by eight orders of magnitude and eliminated token flips, confirming FP16 non-associativity as the sole cause. Layer-wise drift profiling showed Grouped-Query Attention models diverge sharply at the first layer, while Gemma's architecture produced uniform accumulation across layers. Activation patching of the residual stream failed to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache.

Key takeaway

For CTOs and VPs of Engineering designing or deploying LLM inference systems, recognize that FP16 KV caching introduces deterministic, architecturally predictable numerical divergence, making cache-ON and cache-OFF paths non-equivalent. If strict numerical equivalence is critical, consider using FP32 for KV cache operations despite the performance cost, or explore future techniques like periodic cache refresh or high-precision cache. Be aware that GQA architectures amplify this divergence, impacting system reliability and output consistency.

Key insights

FP16 KV cache inference deterministically diverges from cache-free computation due to non-associative floating-point accumulation.

Principles

Method

The study used five experiments: behavioral characterization, layer drift analysis, FP32 falsification, decision boundary analysis, and activation patching to systematically analyze and localize FP16 KV cache divergence.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.