DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

DisPOSE is a novel self-supervised framework designed for multi-view 3D human pose estimation, addressing the generalization limitations of methods relying on synthetic 3D pose catalogs. It models the inherently discrete multi-view person-assignment problem as a generative diffusion process over polystochastic tensors, employing differentiable Sinkhorn projections for feasible assignments. A Hypergraph-Convolutional Decoder then regresses complete 3D skeletons. DisPOSE achieves leading performance among self-supervised methods, improving AP25 by 19% on CMU Panoptic and achieving 75% mAP on novel camera setups, compared to a baseline of 59%. It also demonstrates high label efficiency, retaining 99% of its performance with only 10% of pseudo-labels. The framework performs robustly on the newly proposed MM-OR Pose benchmark, featuring highly occluded surgical scenes, and is 2.8–4.6x faster than SelfPose3D.

Key takeaway

For machine learning engineers developing multi-person 3D pose estimation systems in environments with limited 3D ground truth, DisPOSE offers a compelling self-supervised alternative. Its projected diffusion and hypergraph-based approach significantly improves generalization across diverse camera arrangements and challenging scenes like surgical operating rooms. You should consider this framework to achieve leading performance and high data efficiency, especially when avoiding reliance on synthetic 3D pose catalogs is critical for real-world robustness.

Key insights

Self-supervised multi-view 3D human pose estimation can be achieved by modeling discrete person assignment as projected diffusion on polystochastic tensors.

Principles

Method

DisPOSE uses a two-stage process: first, projected polystochastic diffusion with Sinkhorn projections solves 2D root association and triangulates 3D roots; second, a hypergraph decoder iteratively refines full-body poses using multi-view and person-part convolutions.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.