Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs
Summary
Mix-and-Match Pruning is a novel framework designed to compress deep neural networks (DNNs) for edge device deployment by generating diverse, high-quality pruning configurations. It addresses the limitation of single-strategy pruning methods by leveraging globally guided, layer-wise sparsification. The framework operates in three phases: sensitivity analysis to assign architecture-aware sparsity ranges (e.g., 0% for normalization layers, [0%, 10%] for small layers, [15%, 30%] for Transformer patch embeddings), systematic sampling of these ranges to create ten distinct pruning strategies, and subsequent pruning with fine-tuning. This process yields multiple Pareto-optimal accuracy-sparsity trade-offs from a single pruning run, eliminating the need for repeated executions. Experiments on CNNs (VGG-11, ResNet-18) and Vision Transformers (LeViT-384, Swin-Tiny) demonstrate competitive or superior performance, with Mix-and-Match reducing accuracy degradation on Swin-Tiny by 40% relative to standard single-criterion pruning, while shrinking VGG-11 to ~10 MB and ResNet-18 to ~4.5 MB.
Key takeaway
For AI Engineers deploying DNNs on memory-constrained edge devices, Mix-and-Match Pruning offers a systematic way to achieve strong compression without extensive trial-and-error. You can generate multiple Pareto-optimal accuracy-sparsity configurations from a single pruning run, significantly reducing development time and computational cost compared to traditional methods. Consider applying its architecture-aware sparsity ranges to tailor pruning aggressiveness to specific layer types, ensuring robust performance while minimizing model size.
Key insights
Coordinating existing pruning signals through architecture-aware, layer-wise sparsification yields more efficient and reliable DNN compression.
Principles
- Architectural structure dictates compressibility.
- Different layers respond uniquely to pruning.
- Optimal sensitivity criteria vary by architecture.
Method
The framework computes sensitivity scores once, assigns architecture-aware sparsity ranges per layer, and then systematically samples these ranges to generate ten distinct pruning strategies for a single fine-tuning run.
In practice
- Fix normalization layers at 0% sparsity.
- Cap small layers (<10K params) at [0%, 10%] sparsity.
- Restrict Transformer patch embeddings to [15%, 30%] sparsity.
Topics
- Deep Neural Network Pruning
- Layer-wise Sparsification
- Edge AI Deployment
- Vision Transformers
- Model Compression
Best for: AI Engineer, Computer Vision Engineer, AI Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.