KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
Summary
KVPO (ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration) is a new online Group Relative Policy Optimization (GRPO) framework designed to align streaming autoregressive (AR) video generators with human preferences. Unlike existing methods that rely on noise-based exploration and SDE-based surrogate policies, KVPO introduces a causal-semantic exploration paradigm that uses historical KV cache routing to construct semantically diverse generation branches, ensuring they remain strictly on the data manifold. For policy modeling, KVPO employs a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space. This approach yields a reward-weighted contrastive objective consistent with the native Ordinary Differential Equation (ODE) formulation. Experiments on LongLive and MemFlow, two distilled AR video generators, show consistent improvements in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.
Key takeaway
For research scientists developing or deploying autoregressive video generators, KVPO offers a robust method to improve alignment with human preferences. By shifting exploration from noise injection to semantic KV cache routing, you can achieve more coherent and diverse video outputs without off-manifold distortions. Consider integrating KVPO's ODE-native policy optimization to enhance visual and motion quality, and text-video alignment, especially for long-horizon, multi-prompt scenarios.
Key insights
KVPO aligns AR video generators by exploring semantic variations via KV cache routing and optimizing with an ODE-native velocity-field policy.
Principles
- Relocate variation source from noise to KV cache.
- Quantify branch likelihood in velocity-field space.
- Embed preference optimization into native ODE dynamics.
Method
KVPO uses Causal History Routing (CHR) to stochastically route historical KV entries for semantic exploration. It then defines a Gibbs-form surrogate policy based on Trajectory Velocity Energy (TVE) for reward-weighted contrastive flow-matching optimization.
In practice
- Apply LoRA with rank r=256 and scaling factor α=256.
- Use 32 NVIDIA H200 GPUs for training.
- Compute composite reward from Visual Quality, Motion Quality, and Text-Video Alignment.
Topics
- KVPO
- Autoregressive Video Generation
- Group Relative Policy Optimization
- Causal-Semantic Exploration
- KV Cache Routing
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.