Enabling KV Caching of Shared Prefix for Diffusion Language Models

2026-03-16 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Bidirectional prefix caching (BiCache) is introduced as the first key-value (KV) caching technique for shared prefixes in Diffusion Language Models (DLMs). Existing KV caching methods, designed for autoregressive LLMs, cause DLM accuracy to collapse to near zero due to DLMs' bidirectional attention dynamically altering KVs. BiCache addresses this by dynamically identifying a safe layer depth for KV reuse, based on the shared prefix token fraction. It reuses KVs in shallow layers and periodically refreshes them in deep layers. Evaluations on LLaDA demonstrate BiCache improves serving throughput by 36.3%–98.3% compared to existing techniques, with only a 0–1.8% accuracy difference. It also seamlessly integrates with Fast-dLLM, further boosting throughput by 51.6%–98.3%.

Key takeaway

For MLOps Engineers optimizing Diffusion Language Model serving, directly applying traditional LLM KV caching will severely degrade accuracy. You should consider implementing BiCache's layer-partitioned caching and offline profiling. This approach enables significant throughput improvements, ranging from 36.3% to 98.3%, while preserving model accuracy, making DLM deployments more efficient and reliable.

Key insights

DLM shared prefix KV caching is feasible by adapting reuse based on layer depth and prefix ratio.

Principles

Shared prefix KVs remain stable in shallow layers.
Shallow layer depth correlates with shared prefix ratio.
Deep layer KVs can be reused with periodic refresh.

Method

BiCache uses offline profiling to map shared prefix ratios to safe layer depths, then applies layer-partitioned caching: direct reuse in shallow layers, periodic refresh in deep layers.

In practice

Profile DLMs once to determine safe KV reuse depths.
Implement layer-aware KV caching for DLM serving.
Use periodic KV refresh for deep layers to maintain accuracy.

Topics

Diffusion Language Models
KV Caching
LLM Serving Optimization
Bidirectional Attention
Throughput Improvement
LLaDA

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.