VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

2026-06-11 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

VideoMDM introduces a diffusion-based framework designed to train 3D human motion priors directly from accurate 2D poses extracted from monocular videos, eliminating the need for 3D ground truth data. The system utilizes a pretrained 2D-to-3D lifter to provide approximate 3D pose sequences, which are then diffused and denoised in 3D. Supervision occurs in 2D by reprojecting the model's prediction and comparing it against precise keypoints. The framework demonstrates that a depth-weighted 2D reprojection loss can be equivalent to direct 3D supervision and adapts standard 3D motion regularizers like velocity consistency. Unlike methods that only lift 2D to 3D at inference, VideoMDM learns a coherent 3D motion manifold during training, achieving an FID of 0.88 on HumanML3D, nearly matching the 3D-supervised MDM's 0.54. It also generates human-preferred motions on real video datasets like Fit3D and NBA.

Key takeaway

For ML engineers developing 3D human motion generation systems, VideoMDM offers a compelling approach to overcome the scarcity of 3D ground truth data. You should consider integrating 2D-supervised diffusion models to leverage abundant monocular video data, significantly reducing data collection costs and complexity. This method enables learning robust 3D motion manifolds directly from 2D inputs, improving model coherence and performance on real-world video datasets like Fit3D and NBA.

Key insights

VideoMDM trains 3D human motion models from 2D video supervision, achieving strong results without 3D ground truth.

Principles

Depth-weighted 2D reprojection loss can substitute 3D supervision.
Adapt 3D motion regularizers to 2D settings.
Learning a 3D motion manifold during training is key.

Method

VideoMDM diffuses approximate 3D poses from a 2D-to-3D lifter, denoises in 3D, and supervises via 2D reprojection against accurate keypoints, adapting 3D motion regularizers.

In practice

Generate 3D human motion from readily available 2D video.
Reduce reliance on expensive 3D motion capture data.
Apply to sports analytics or character animation.

Topics

3D Human Motion Generation
Diffusion Models
2D-to-3D Lifting
Monocular Video Analysis
Pose Estimation
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.