VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

VideoMDM introduces a diffusion-based framework designed to train 3D human motion priors directly from accurate 2D poses extracted from monocular videos, eliminating the need for 3D ground truth data. The system utilizes a pretrained 2D-to-3D lifter to provide approximate 3D pose sequences, which are then diffused and denoised in 3D. Supervision occurs in 2D by reprojecting the model's prediction and comparing it against precise keypoints. The framework demonstrates that a depth-weighted 2D reprojection loss can be equivalent to direct 3D supervision and adapts standard 3D motion regularizers like velocity consistency. Unlike methods that only lift 2D to 3D at inference, VideoMDM learns a coherent 3D motion manifold during training, achieving an FID of 0.88 on HumanML3D, nearly matching the 3D-supervised MDM's 0.54. It also generates human-preferred motions on real video datasets like Fit3D and NBA.

Key takeaway

For ML engineers developing 3D human motion generation systems, VideoMDM offers a compelling approach to overcome the scarcity of 3D ground truth data. You should consider integrating 2D-supervised diffusion models to leverage abundant monocular video data, significantly reducing data collection costs and complexity. This method enables learning robust 3D motion manifolds directly from 2D inputs, improving model coherence and performance on real-world video datasets like Fit3D and NBA.

Key insights

VideoMDM trains 3D human motion models from 2D video supervision, achieving strong results without 3D ground truth.

Principles

Method

VideoMDM diffuses approximate 3D poses from a 2D-to-3D lifter, denoises in 3D, and supervises via 2D reprojection against accurate keypoints, adapting 3D motion regularizers.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.