PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

PhyMotion introduces a structured, fine-grained motion reward designed to improve the realism of human motion in generated videos. This system addresses the limitations of existing 2D perceptual rewards by grounding recovered 3D human trajectories in a physics simulator, specifically MuJoCo. It evaluates motion quality across three distinct dimensions of physical feasibility: kinematic plausibility, contact and balance consistency, and dynamic feasibility. By recovering SMPL body meshes from generated videos and retargeting them onto a humanoid in the simulator, PhyMotion provides continuous and interpretable signals for specific aspects of motion quality. Experiments demonstrate that PhyMotion achieves an 80% average pairwise agreement with human judgments and a Spearman correlation of ρ=0.376, outperforming existing rewards. When used for RL-based post-training, it consistently improves motion realism across autoregressive and bidirectional video generators, yielding a +68 Elo gain in blind human evaluation and an average 7.1% improvement on external evaluators like VBench metrics.

Key takeaway

For research scientists developing human video generation models, you should consider integrating physics-grounded 3D motion rewards like PhyMotion into your reinforcement learning post-training pipelines. This approach offers superior alignment with human perception of motion realism and provides fine-grained diagnostic signals, leading to more physically plausible and natural human movements in generated videos compared to relying solely on 2D perceptual metrics.

Key insights

Physics-grounded 3D motion rewards significantly enhance human motion realism in video generation by evaluating physical feasibility.

Principles

Method

PhyMotion recovers SMPL meshes from videos, retargets them to a MuJoCo humanoid, and evaluates kinematic, contact/balance, and dynamic feasibility to generate a structured reward for RL post-training.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.