RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

The Real-time Autoregressive Video Extrapolation Network (RAVEN) is a new training-time test framework designed to improve real-time streaming video generation using causal autoregressive video diffusion models. These models extrapolate future video chunks from past content, but a mismatch between training and inference history distributions often degrades long-horizon generation quality. RAVEN addresses this by repacking self-rollouts into interleaved sequences of clean historical endpoints and noisy denoising states, aligning training attention with inference-time extrapolation and enabling supervision of history representations. Additionally, the paper introduces Consistency-model Group Relative Policy Optimization (CM-GRPO), which applies online Reinforcement Learning directly to a consistency sampling step, treating it as a conditional Gaussian transition. Experiments show RAVEN outperforms existing causal video distillation baselines in quality, semantic, and dynamic degree evaluations, with CM-GRPO providing further improvements.

Key takeaway

For research scientists developing real-time video generation systems, RAVEN offers a robust framework to mitigate distribution shifts between training and inference, leading to higher quality and more consistent long-horizon video extrapolation. You should consider integrating RAVEN's training-time test framework and CM-GRPO's reinforcement learning approach to significantly enhance the performance of your autoregressive video diffusion models.

Key insights

RAVEN and CM-GRPO enhance real-time video extrapolation by aligning training with inference and applying RL to consistency models.

Principles

Method

RAVEN repacks self-rollouts into interleaved clean historical endpoints and noisy denoising states. CM-GRPO applies online RL to a consistency sampling step as a conditional Gaussian transition.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.