CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers
Summary
Qualcomm AI Research introduces CoReDiT, a structured token pruning framework designed to enhance the efficiency of Diffusion Transformers (DiTs) for image and video generation. DiTs, while offering high-quality synthesis, suffer from significant computational costs, particularly in self-attention, which scales quadratically with token count. CoReDiT addresses this by employing a linear-time "spatial coherence" score to identify and skip redundant tokens in self-attention, achieving up to 55% FLOPs reduction. It reconstructs skipped attention outputs using coherence-guided aggregation of neighboring retained tokens to maintain visual continuity and avoid artifacts. Furthermore, CoReDiT incorporates a progressive, block-adaptive pruning schedule that dynamically allocates pruning budgets based on redundancy across transformer blocks and denoising steps. This framework delivers inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while preserving high visual quality and increasing on-device memory headroom for higher-resolution generation.
Key takeaway
For AI Engineers and Research Scientists working with Diffusion Transformers, CoReDiT offers a practical method to significantly reduce inference costs and enable higher-resolution generation on resource-constrained devices. You should consider integrating CoReDiT's spatial coherence-guided pruning and reconstruction into your DiT workflows to achieve substantial speedups and memory efficiency without compromising visual quality, especially for mobile or edge deployments.
Key insights
CoReDiT efficiently prunes Diffusion Transformers by leveraging spatial coherence for token selection and reconstruction, significantly reducing computational cost.
Principles
- Redundant tokens exhibit strong local spatial coherence.
- Reconstructing skipped tokens preserves spatial continuity.
- Pruning schedules should adapt to block and timestep redundancy.
Method
CoReDiT uses a linear-time spatial coherence score for token selection, reconstructs skipped tokens via coherence-guided aggregation, and employs a progressive, block-adaptive pruning schedule during lightweight fine-tuning.
In practice
- Achieves 1.72x speedup on mobile NPUs.
- Enables higher-resolution generation by increasing memory headroom.
- Reduces self-attention FLOPs by up to 55%.
Topics
- Diffusion Transformers
- Token Pruning
- Spatial Coherence
- Token Reconstruction
- Adaptive Pruning Schedule
Code references
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.