Shift-and-Sum Quantization for Visual Autoregressive Models

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A new post-training quantization (PTQ) framework, "Shift-and-Sum Quantization," addresses key challenges in applying PTQ to visual autoregressive models (VAR). The framework tackles two primary issues: large reconstruction errors in attention-value products, particularly at coarse scales with high attention scores, and a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To mitigate these, the framework introduces a shift-and-sum quantization method that aggregates quantized results from symmetrically shifted duplicates of value tokens, thereby reducing reconstruction errors. Additionally, it employs a resampling strategy for calibration data to align sampling frequencies with predicted probabilities. Experiments demonstrate consistent improvements across various VAR architectures in tasks such as class-conditional image generation, inpainting, outpainting, and class-conditional editing, establishing a new state of the art for PTQ in VAR.

Key takeaway

For Machine Learning Engineers deploying visual autoregressive models, this new Shift-and-Sum Quantization framework offers a robust solution to improve efficiency without sacrificing performance. If your current post-training quantization efforts face high reconstruction errors or calibration data issues, consider implementing this method. It directly addresses these challenges, enabling more accurate and efficient deployment for tasks like image generation and editing, potentially reducing your inference costs and memory footprint.

Key insights

A novel post-training quantization framework significantly reduces reconstruction errors and aligns sampling frequencies for visual autoregressive models.

Principles

Attention-value product errors are critical in VAR PTQ.
Aligning codebook sampling with predicted probabilities is crucial.
Aggregating symmetrically shifted duplicates reduces quantization error.

Method

The framework employs shift-and-sum quantization to aggregate results from symmetrically shifted value tokens and a calibration data resampling strategy to align codebook entry frequencies with predicted probabilities.

In practice

Apply to class-conditional image generation.
Improve inpainting and outpainting tasks.
Enhance class-conditional image editing.

Topics

Post-training Quantization
Visual Autoregressive Models
Shift-and-Sum Quantization
Image Generation
Inpainting
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.