LearniBridge: Learnable Calibration of Feature Caching for Diffusion Models Acceleration

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

LearniBridge is a novel learnable calibration mechanism designed to accelerate Diffusion Transformers (DiTs), which are known for their high computational costs in image and video generation. While existing feature caching methods reuse intermediate representations for faster inference, they often suffer significant error accumulation at high acceleration ratios. LearniBridge addresses this by identifying that optimal feature correction updates are characterized by a shared low-rank subspace across diverse prompts. It employs lightweight LoRA updates to bridge multiple timesteps, enabling effective calibration with only 3-5 training samples. Extensive experiments demonstrate LearniBridge achieves substantial acceleration, reaching up to 5.87x on FLUX, 5.75x on HunyuanVideo, and 4.10x on WAN2.1. Furthermore, on WAN2.1, it improves VBench by 1.28% over the previous state-of-the-art at 4.10x acceleration.

Key takeaway

For Machine Learning Engineers optimizing Diffusion Transformer inference, LearniBridge offers a compelling solution to overcome feature caching limitations. If you are struggling with error accumulation at high acceleration ratios, consider implementing LearniBridge's learnable calibration. It enables significant speedups, like 5.87x on FLUX, while improving quality, demonstrated by a 1.28% VBench gain on WAN2.1. Evaluate its lightweight LoRA updates for your DiT models, especially given its minimal 3-5 training sample requirement.

Key insights

LearniBridge accelerates DiTs via learnable LoRA calibration of feature caching, correcting errors with few samples for high acceleration.

Principles

Method

LearniBridge calibrates feature caching by identifying a shared low-rank subspace for optimal correction. It applies lightweight LoRA updates to bridge multiple timesteps, requiring only 3-5 training samples for effective calibration.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.