VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation
Summary
Visual Prefix Guidance (VPG) is a novel, training-free inference-time method designed to mitigate exposure bias and prefix drift in autoregressive image and video generators. These models, trained with teacher-forced histories, struggle at inference when sampling from their own generated prefixes, leading to quality degradation. Unlike prior guidance techniques focused on external semantic conditions like class labels or text prompts, VPG enhances next-step predictions by contrasting the model's output under a generated prefix with its output under a corrupted prefix. It then extrapolates logits towards candidates that bolster the posterior support of the generated prefix. VPG significantly improves generation quality across various benchmarks, reducing FID on VAR by 0.36 on average, and enhancing performance for class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, all without requiring base model retraining.
Key takeaway
For Machine Learning Engineers and AI Scientists developing autoregressive image and video generation models, VPG offers a crucial, training-free solution to combat exposure bias and prefix drift. If you are seeking to enhance generation quality without the computational cost of retraining, you should evaluate VPG's inference-time guidance. Integrating this method can significantly improve benchmark performance on models like VAR, Infinity, and InfinityStar, reducing FID scores and producing higher-quality outputs.
Key insights
VPG is a training-free inference method that improves autoregressive generation by strengthening prefix posterior support.
Principles
- Autoregressive models suffer from exposure bias.
- Inference-time guidance can improve generation quality.
- Strengthening prefix posterior support is key.
Method
VPG contrasts model output under generated and corrupted prefixes, then extrapolates logits to enhance posterior support for the generated prefix.
In practice
- Apply VPG to VAR for class-conditional images.
- Use VPG with Infinity for text-to-image tasks.
- Integrate VPG into InfinityStar for text-to-video.
Topics
- Autoregressive Models
- Image Generation
- Video Generation
- Exposure Bias
- Inference Guidance
- VAR
- InfinityStar
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.