Shift-and-Sum Quantization for Visual Autoregressive Models
Summary
A new post-training quantization (PTQ) framework, "Shift-and-Sum Quantization," addresses key challenges in applying PTQ to visual autoregressive models (VAR). The framework tackles two primary issues: large reconstruction errors in attention-value products, particularly at coarse scales with high attention scores, and a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To mitigate these, the framework introduces a shift-and-sum quantization method that aggregates quantized results from symmetrically shifted duplicates of value tokens, thereby reducing reconstruction errors. Additionally, it employs a resampling strategy for calibration data to align sampling frequencies with predicted probabilities. Experiments demonstrate consistent improvements across various VAR architectures in tasks such as class-conditional image generation, inpainting, outpainting, and class-conditional editing, establishing a new state of the art for PTQ in VAR.
Key takeaway
For Machine Learning Engineers deploying visual autoregressive models, this new Shift-and-Sum Quantization framework offers a robust solution to improve efficiency without sacrificing performance. If your current post-training quantization efforts face high reconstruction errors or calibration data issues, consider implementing this method. It directly addresses these challenges, enabling more accurate and efficient deployment for tasks like image generation and editing, potentially reducing your inference costs and memory footprint.
Key insights
A novel post-training quantization framework significantly reduces reconstruction errors and aligns sampling frequencies for visual autoregressive models.
Principles
- Attention-value product errors are critical in VAR PTQ.
- Aligning codebook sampling with predicted probabilities is crucial.
- Aggregating symmetrically shifted duplicates reduces quantization error.
Method
The framework employs shift-and-sum quantization to aggregate results from symmetrically shifted value tokens and a calibration data resampling strategy to align codebook entry frequencies with predicted probabilities.
In practice
- Apply to class-conditional image generation.
- Improve inpainting and outpainting tasks.
- Enhance class-conditional image editing.
Topics
- Post-training Quantization
- Visual Autoregressive Models
- Shift-and-Sum Quantization
- Image Generation
- Inpainting
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.