KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, medium

Summary

KVPO (ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration) is a new online Group Relative Policy Optimization (GRPO) framework designed to align streaming autoregressive (AR) video generators with human preferences. Unlike existing methods that rely on noise-based exploration and SDE-based surrogate policies, KVPO introduces a causal-semantic exploration paradigm that uses historical KV cache routing to construct semantically diverse generation branches, ensuring they remain strictly on the data manifold. For policy modeling, KVPO employs a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space. This approach yields a reward-weighted contrastive objective consistent with the native Ordinary Differential Equation (ODE) formulation. Experiments on LongLive and MemFlow, two distilled AR video generators, show consistent improvements in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

Key takeaway

For research scientists developing or deploying autoregressive video generators, KVPO offers a robust method to improve alignment with human preferences. By shifting exploration from noise injection to semantic KV cache routing, you can achieve more coherent and diverse video outputs without off-manifold distortions. Consider integrating KVPO's ODE-native policy optimization to enhance visual and motion quality, and text-video alignment, especially for long-horizon, multi-prompt scenarios.

Key insights

KVPO aligns AR video generators by exploring semantic variations via KV cache routing and optimizing with an ODE-native velocity-field policy.

Principles

Method

KVPO uses Causal History Routing (CHR) to stochastically route historical KV entries for semantic exploration. It then defines a Gibbs-form surrogate policy based on Trajectory Velocity Energy (TVE) for reward-weighted contrastive flow-matching optimization.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.