Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

Researchers Qi Mao, Zijian Wang, Zhengxue Cheng, Lingyu Zhu, and Siwei Ma introduce Mixture of Decoder Experts (MoDE), a novel dual-latent collaborative decoding framework designed to achieve a favorable fidelity–perception balance in learned image compression (LIC). Existing LIC methods often struggle to simultaneously maintain structural fidelity and perceptual realism, especially across varying bitrates, because they rely on a single latent representation. MoDE addresses this by treating scalar-quantized (SQ) continuous latents as a fidelity-oriented expert and vector-quantized (VQ) discrete tokens as a perception-oriented expert. The framework coordinates these two frozen decoders via two learned decoder-side modules: Expert-Specific Enhancement (ESE) for preserving branch-specific references and Cross-Expert Modulation (CEM) for selective complementary transfer. MoDE supports both fidelity-anchored (MoDE-F) and perception-anchored (MoDE-P) decoding under a shared dual-stream bitstream, demonstrating superior performance against various baselines on datasets like Kodak, CLIC2020, and Tecnick.

Key takeaway

For research scientists developing advanced image compression techniques, MoDE offers a robust framework to overcome the fidelity–perception trade-off. By explicitly separating fidelity and perception responsibilities into distinct decoder experts and coordinating them through ESE and CEM, you can achieve superior image quality across a wide range of bitrates. Consider adopting this dual-latent, decoder-side collaboration approach to enhance both structural accuracy and perceptual realism in your next-generation codecs, particularly when dealing with diverse bitrate requirements.

Key insights

Decomposing image compression into fidelity and perception experts via dual-latent decoding balances conflicting reconstruction goals.

Principles

Method

MoDE coordinates frozen SQ (fidelity) and VQ (perception) decoders using Expert-Specific Enhancement (ESE) to maintain branch references and Cross-Expert Modulation (CEM) for gated, selective cross-expert information transfer during reconstruction.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.