KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

2026-05-15 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, medium

Summary

KVPO (ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration) is a new online Group Relative Policy Optimization (GRPO) framework designed to align streaming autoregressive (AR) video generators with human preferences. Unlike existing methods that rely on noise-based exploration and SDE-based surrogate policies, KVPO introduces a causal-semantic exploration paradigm that uses historical KV cache routing to construct semantically diverse generation branches, ensuring they remain strictly on the data manifold. For policy modeling, KVPO employs a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space. This approach yields a reward-weighted contrastive objective consistent with the native Ordinary Differential Equation (ODE) formulation. Experiments on LongLive and MemFlow, two distilled AR video generators, show consistent improvements in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

Key takeaway

For research scientists developing or deploying autoregressive video generators, KVPO offers a robust method to improve alignment with human preferences. By shifting exploration from noise injection to semantic KV cache routing, you can achieve more coherent and diverse video outputs without off-manifold distortions. Consider integrating KVPO's ODE-native policy optimization to enhance visual and motion quality, and text-video alignment, especially for long-horizon, multi-prompt scenarios.

Key insights

KVPO aligns AR video generators by exploring semantic variations via KV cache routing and optimizing with an ODE-native velocity-field policy.

Principles

Relocate variation source from noise to KV cache.
Quantify branch likelihood in velocity-field space.
Embed preference optimization into native ODE dynamics.

Method

KVPO uses Causal History Routing (CHR) to stochastically route historical KV entries for semantic exploration. It then defines a Gibbs-form surrogate policy based on Trajectory Velocity Energy (TVE) for reward-weighted contrastive flow-matching optimization.

In practice

Apply LoRA with rank r=256 and scaling factor α=256.
Use 32 NVIDIA H200 GPUs for training.
Compute composite reward from Visual Quality, Motion Quality, and Text-Video Alignment.

Topics

KVPO
Autoregressive Video Generation
Group Relative Policy Optimization
Causal-Semantic Exploration
KV Cache Routing

Code references

Richard-Zhang-AI/KVPO

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.