Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs

2026-03-24 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

Mix-and-Match Pruning is a novel framework designed to compress deep neural networks (DNNs) for edge device deployment by generating diverse, high-quality pruning configurations. It addresses the limitation of single-strategy pruning methods by leveraging globally guided, layer-wise sparsification. The framework operates in three phases: sensitivity analysis to assign architecture-aware sparsity ranges (e.g., 0% for normalization layers, [0%, 10%] for small layers, [15%, 30%] for Transformer patch embeddings), systematic sampling of these ranges to create ten distinct pruning strategies, and subsequent pruning with fine-tuning. This process yields multiple Pareto-optimal accuracy-sparsity trade-offs from a single pruning run, eliminating the need for repeated executions. Experiments on CNNs (VGG-11, ResNet-18) and Vision Transformers (LeViT-384, Swin-Tiny) demonstrate competitive or superior performance, with Mix-and-Match reducing accuracy degradation on Swin-Tiny by 40% relative to standard single-criterion pruning, while shrinking VGG-11 to ~10 MB and ResNet-18 to ~4.5 MB.

Key takeaway

For AI Engineers deploying DNNs on memory-constrained edge devices, Mix-and-Match Pruning offers a systematic way to achieve strong compression without extensive trial-and-error. You can generate multiple Pareto-optimal accuracy-sparsity configurations from a single pruning run, significantly reducing development time and computational cost compared to traditional methods. Consider applying its architecture-aware sparsity ranges to tailor pruning aggressiveness to specific layer types, ensuring robust performance while minimizing model size.

Key insights

Coordinating existing pruning signals through architecture-aware, layer-wise sparsification yields more efficient and reliable DNN compression.

Principles

Architectural structure dictates compressibility.
Different layers respond uniquely to pruning.
Optimal sensitivity criteria vary by architecture.

Method

The framework computes sensitivity scores once, assigns architecture-aware sparsity ranges per layer, and then systematically samples these ranges to generate ten distinct pruning strategies for a single fine-tuning run.

In practice

Fix normalization layers at 0% sparsity.
Cap small layers (<10K params) at [0%, 10%] sparsity.
Restrict Transformer patch embeddings to [15%, 30%] sparsity.

Topics

Deep Neural Network Pruning
Layer-wise Sparsification
Edge AI Deployment
Vision Transformers
Model Compression

Best for: AI Engineer, Computer Vision Engineer, AI Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.