DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation
Summary
DisPOSE is a novel self-supervised framework designed for multi-view 3D human pose estimation, addressing the generalization limitations of methods relying on synthetic 3D pose catalogs. It models the inherently discrete multi-view person-assignment problem as a generative diffusion process over polystochastic tensors, employing differentiable Sinkhorn projections for feasible assignments. A Hypergraph-Convolutional Decoder then regresses complete 3D skeletons. DisPOSE achieves leading performance among self-supervised methods, improving AP25 by 19% on CMU Panoptic and achieving 75% mAP on novel camera setups, compared to a baseline of 59%. It also demonstrates high label efficiency, retaining 99% of its performance with only 10% of pseudo-labels. The framework performs robustly on the newly proposed MM-OR Pose benchmark, featuring highly occluded surgical scenes, and is 2.8–4.6x faster than SelfPose3D.
Key takeaway
For machine learning engineers developing multi-person 3D pose estimation systems in environments with limited 3D ground truth, DisPOSE offers a compelling self-supervised alternative. Its projected diffusion and hypergraph-based approach significantly improves generalization across diverse camera arrangements and challenging scenes like surgical operating rooms. You should consider this framework to achieve leading performance and high data efficiency, especially when avoiding reliance on synthetic 3D pose catalogs is critical for real-world robustness.
Key insights
Self-supervised multi-view 3D human pose estimation can be achieved by modeling discrete person assignment as projected diffusion on polystochastic tensors.
Principles
- Projected diffusion on polystochastic tensors enables robust multi-view person assignment.
- Disentangling assignment from root regression improves generalization across camera setups.
- Hypergraph convolutions effectively model complex relational structures for pose refinement.
Method
DisPOSE uses a two-stage process: first, projected polystochastic diffusion with Sinkhorn projections solves 2D root association and triangulates 3D roots; second, a hypergraph decoder iteratively refines full-body poses using multi-view and person-part convolutions.
In practice
- Train with off-the-shelf 2D pose detections and 3D pseudo-labels for weak supervision.
- Apply multi-marginal Sinkhorn normalization to enforce physical validity in assignment tensors.
- Use geometric and photometric augmentations to enhance model robustness to real-world variations.
Topics
- 3D Human Pose Estimation
- Self-Supervised Learning
- Diffusion Models
- Hypergraph Neural Networks
- Multi-View Systems
- Polystochastic Tensors
- Surgical Workflow Analysis
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.