Enabling KV Caching of Shared Prefix for Diffusion Language Models

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Bidirectional prefix caching (BiCache) is introduced as the first key-value (KV) caching technique for shared prefixes in Diffusion Language Models (DLMs). Existing KV caching methods, designed for autoregressive LLMs, cause DLM accuracy to collapse to near zero due to DLMs' bidirectional attention dynamically altering KVs. BiCache addresses this by dynamically identifying a safe layer depth for KV reuse, based on the shared prefix token fraction. It reuses KVs in shallow layers and periodically refreshes them in deep layers. Evaluations on LLaDA demonstrate BiCache improves serving throughput by 36.3%–98.3% compared to existing techniques, with only a 0–1.8% accuracy difference. It also seamlessly integrates with Fast-dLLM, further boosting throughput by 51.6%–98.3%.

Key takeaway

For MLOps Engineers optimizing Diffusion Language Model serving, directly applying traditional LLM KV caching will severely degrade accuracy. You should consider implementing BiCache's layer-partitioned caching and offline profiling. This approach enables significant throughput improvements, ranging from 36.3% to 98.3%, while preserving model accuracy, making DLM deployments more efficient and reliable.

Key insights

DLM shared prefix KV caching is feasible by adapting reuse based on layer depth and prefix ratio.

Principles

Method

BiCache uses offline profiling to map shared prefix ratios to safe layer depths, then applies layer-partitioned caching: direct reuse in shallow layers, periodic refresh in deep layers.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.