RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

RefDecoder is a novel reference-conditioned video VAE decoder designed to enhance visual generation by addressing the architectural asymmetry in latent diffusion models, where decoders often remain unconditional despite heavily conditioned denoising networks. This asymmetry typically results in significant detail loss and inconsistency compared to the input image. RefDecoder injects high-fidelity reference image signals directly into the decoding process using reference attention. A lightweight image encoder maps the reference frame into detail-rich high-dimensional tokens, which are then co-processed with denoised video latent tokens during each decoder up-sampling stage. This approach yields consistent improvements, achieving up to +2.1dB PSNR over unconditional baselines on Inter4K, WebVid, and Large Motion reconstruction benchmarks, and enhances subject consistency, background consistency, and overall quality on the VBench I2V benchmark.

Key takeaway

For research scientists developing or deploying video generation systems, RefDecoder offers a direct, drop-in enhancement. You can integrate this reference-conditioned VAE decoder without fine-tuning existing models to achieve significant improvements in detail preservation, subject consistency, and overall video quality, extending its utility across various visual generation tasks like style transfer and video editing.

Key insights

Conditional video decoders improve detail preservation and consistency in visual generation.

Principles

Method

RefDecoder uses a lightweight image encoder to map reference frames into high-dimensional tokens, co-processed with denoised video latents via reference attention during decoder up-sampling.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.