Enabling KV Caching of Shared Prefix for Diffusion Language Models
Summary
Bidirectional prefix caching (BiCache) is introduced as the first key-value (KV) caching technique for shared prefixes in Diffusion Language Models (DLMs). Existing KV caching methods, designed for autoregressive LLMs, cause DLM accuracy to collapse to near zero due to DLMs' bidirectional attention dynamically altering KVs. BiCache addresses this by dynamically identifying a safe layer depth for KV reuse, based on the shared prefix token fraction. It reuses KVs in shallow layers and periodically refreshes them in deep layers. Evaluations on LLaDA demonstrate BiCache improves serving throughput by 36.3%–98.3% compared to existing techniques, with only a 0–1.8% accuracy difference. It also seamlessly integrates with Fast-dLLM, further boosting throughput by 51.6%–98.3%.
Key takeaway
For MLOps Engineers optimizing Diffusion Language Model serving, directly applying traditional LLM KV caching will severely degrade accuracy. You should consider implementing BiCache's layer-partitioned caching and offline profiling. This approach enables significant throughput improvements, ranging from 36.3% to 98.3%, while preserving model accuracy, making DLM deployments more efficient and reliable.
Key insights
DLM shared prefix KV caching is feasible by adapting reuse based on layer depth and prefix ratio.
Principles
- Shared prefix KVs remain stable in shallow layers.
- Shallow layer depth correlates with shared prefix ratio.
- Deep layer KVs can be reused with periodic refresh.
Method
BiCache uses offline profiling to map shared prefix ratios to safe layer depths, then applies layer-partitioned caching: direct reuse in shallow layers, periodic refresh in deep layers.
In practice
- Profile DLMs once to determine safe KV reuse depths.
- Implement layer-aware KV caching for DLM serving.
- Use periodic KV refresh for deep layers to maintain accuracy.
Topics
- Diffusion Language Models
- KV Caching
- LLM Serving Optimization
- Bidirectional Attention
- Throughput Improvement
- LLaDA
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.