Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

MaskAQ introduces a novel Masked Attention Alignment approach for Data-Free Quantization (DFQ) of Vision Transformers (ViTs), addressing performance degradation caused by distribution mismatch in synthetic samples. This method identifies "informative regions"—sparse, semantically critical image patches—and selectively couples them with quantized models (Q). MaskAQ employs three key components: informative region decoupling, which maximizes differential entropy over patch similarity to create coherent semantic structures; masked attention coupling, utilizing an adaptive mask and alignment objective to bridge synthetic samples with Q; and a periodic sample refreshing strategy to adapt to Q's evolving state. Experiments demonstrate MaskAQ's superiority, achieving up to 3.1% Top-1 accuracy gains on ImageNet for DeiT-T in 3-bit quantization, and consistent improvements across ViT, DeiT, and Swin Transformer backbones for classification, detection, and segmentation tasks.

Key takeaway

For Machine Learning Engineers deploying Vision Transformers on edge devices where data privacy restricts access to original training data, you should consider MaskAQ's approach. Its focus on aligning "informative regions" in synthetic samples significantly improves quantization accuracy, especially at ultra-low bit widths like 3-bit. This mitigates semantic dispersion and attentional disparity, offering a robust solution for achieving high-quality quantized models without real data. Evaluate its iterative synthesis overhead against the substantial performance gains.

Key insights

Data-Free Quantization for ViTs improves by aligning informative regions of synthetic samples with quantized model attention.

Principles

Method

MaskAQ decouples informative regions using differential entropy, then couples them with varying quantized models via an adaptive mask and masked attention alignment, refreshed periodically.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.