VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Visual Prefix Guidance (VPG) is a novel, training-free inference-time method designed to mitigate exposure bias and prefix drift in autoregressive image and video generators. These models, trained with teacher-forced histories, struggle at inference when sampling from their own generated prefixes, leading to quality degradation. Unlike prior guidance techniques focused on external semantic conditions like class labels or text prompts, VPG enhances next-step predictions by contrasting the model's output under a generated prefix with its output under a corrupted prefix. It then extrapolates logits towards candidates that bolster the posterior support of the generated prefix. VPG significantly improves generation quality across various benchmarks, reducing FID on VAR by 0.36 on average, and enhancing performance for class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, all without requiring base model retraining.

Key takeaway

For Machine Learning Engineers and AI Scientists developing autoregressive image and video generation models, VPG offers a crucial, training-free solution to combat exposure bias and prefix drift. If you are seeking to enhance generation quality without the computational cost of retraining, you should evaluate VPG's inference-time guidance. Integrating this method can significantly improve benchmark performance on models like VAR, Infinity, and InfinityStar, reducing FID scores and producing higher-quality outputs.

Key insights

VPG is a training-free inference method that improves autoregressive generation by strengthening prefix posterior support.

Principles

Method

VPG contrasts model output under generated and corrupted prefixes, then extrapolates logits to enhance posterior support for the generated prefix.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.