CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Internet of Things (IoT) & Connected Devices · Depth: Expert, extended

Summary

Qualcomm AI Research introduces CoReDiT, a structured token pruning framework designed to enhance the efficiency of Diffusion Transformers (DiTs) for image and video generation. DiTs, while offering high-quality synthesis, suffer from significant computational costs, particularly in self-attention, which scales quadratically with token count. CoReDiT addresses this by employing a linear-time "spatial coherence" score to identify and skip redundant tokens in self-attention, achieving up to 55% FLOPs reduction. It reconstructs skipped attention outputs using coherence-guided aggregation of neighboring retained tokens to maintain visual continuity and avoid artifacts. Furthermore, CoReDiT incorporates a progressive, block-adaptive pruning schedule that dynamically allocates pruning budgets based on redundancy across transformer blocks and denoising steps. This framework delivers inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while preserving high visual quality and increasing on-device memory headroom for higher-resolution generation.

Key takeaway

For AI Engineers and Research Scientists working with Diffusion Transformers, CoReDiT offers a practical method to significantly reduce inference costs and enable higher-resolution generation on resource-constrained devices. You should consider integrating CoReDiT's spatial coherence-guided pruning and reconstruction into your DiT workflows to achieve substantial speedups and memory efficiency without compromising visual quality, especially for mobile or edge deployments.

Key insights

CoReDiT efficiently prunes Diffusion Transformers by leveraging spatial coherence for token selection and reconstruction, significantly reducing computational cost.

Principles

Method

CoReDiT uses a linear-time spatial coherence score for token selection, reconstructs skipped tokens via coherence-guided aggregation, and employs a progressive, block-adaptive pruning schedule during lightweight fine-tuning.

In practice

Topics

Code references

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.