VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
Summary
VideoMDM introduces a diffusion-based framework designed to train 3D human motion priors directly from accurate 2D poses extracted from monocular videos, eliminating the need for 3D ground truth data. The system utilizes a pretrained 2D-to-3D lifter to provide approximate 3D pose sequences, which are then diffused and denoised in 3D. Supervision occurs in 2D by reprojecting the model's prediction and comparing it against precise keypoints. The framework demonstrates that a depth-weighted 2D reprojection loss can be equivalent to direct 3D supervision and adapts standard 3D motion regularizers like velocity consistency. Unlike methods that only lift 2D to 3D at inference, VideoMDM learns a coherent 3D motion manifold during training, achieving an FID of 0.88 on HumanML3D, nearly matching the 3D-supervised MDM's 0.54. It also generates human-preferred motions on real video datasets like Fit3D and NBA.
Key takeaway
For ML engineers developing 3D human motion generation systems, VideoMDM offers a compelling approach to overcome the scarcity of 3D ground truth data. You should consider integrating 2D-supervised diffusion models to leverage abundant monocular video data, significantly reducing data collection costs and complexity. This method enables learning robust 3D motion manifolds directly from 2D inputs, improving model coherence and performance on real-world video datasets like Fit3D and NBA.
Key insights
VideoMDM trains 3D human motion models from 2D video supervision, achieving strong results without 3D ground truth.
Principles
- Depth-weighted 2D reprojection loss can substitute 3D supervision.
- Adapt 3D motion regularizers to 2D settings.
- Learning a 3D motion manifold during training is key.
Method
VideoMDM diffuses approximate 3D poses from a 2D-to-3D lifter, denoises in 3D, and supervises via 2D reprojection against accurate keypoints, adapting 3D motion regularizers.
In practice
- Generate 3D human motion from readily available 2D video.
- Reduce reliance on expensive 3D motion capture data.
- Apply to sports analytics or character animation.
Topics
- 3D Human Motion Generation
- Diffusion Models
- 2D-to-3D Lifting
- Monocular Video Analysis
- Pose Estimation
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.