DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

DisPOSE is a novel self-supervised framework designed for multi-view 3D human pose estimation, addressing the generalization limitations of methods relying on synthetic 3D pose catalogs. It models the inherently discrete multi-view person-assignment problem as a generative diffusion process over polystochastic tensors, employing differentiable Sinkhorn projections for feasible assignments. A Hypergraph-Convolutional Decoder then regresses complete 3D skeletons. DisPOSE achieves leading performance among self-supervised methods, improving AP25 by 19% on CMU Panoptic and achieving 75% mAP on novel camera setups, compared to a baseline of 59%. It also demonstrates high label efficiency, retaining 99% of its performance with only 10% of pseudo-labels. The framework performs robustly on the newly proposed MM-OR Pose benchmark, featuring highly occluded surgical scenes, and is 2.8–4.6x faster than SelfPose3D.

Key takeaway

For machine learning engineers developing multi-person 3D pose estimation systems in environments with limited 3D ground truth, DisPOSE offers a compelling self-supervised alternative. Its projected diffusion and hypergraph-based approach significantly improves generalization across diverse camera arrangements and challenging scenes like surgical operating rooms. You should consider this framework to achieve leading performance and high data efficiency, especially when avoiding reliance on synthetic 3D pose catalogs is critical for real-world robustness.

Key insights

Self-supervised multi-view 3D human pose estimation can be achieved by modeling discrete person assignment as projected diffusion on polystochastic tensors.

Principles

Projected diffusion on polystochastic tensors enables robust multi-view person assignment.
Disentangling assignment from root regression improves generalization across camera setups.
Hypergraph convolutions effectively model complex relational structures for pose refinement.

Method

DisPOSE uses a two-stage process: first, projected polystochastic diffusion with Sinkhorn projections solves 2D root association and triangulates 3D roots; second, a hypergraph decoder iteratively refines full-body poses using multi-view and person-part convolutions.

In practice

Train with off-the-shelf 2D pose detections and 3D pseudo-labels for weak supervision.
Apply multi-marginal Sinkhorn normalization to enforce physical validity in assignment tensors.
Use geometric and photometric augmentations to enhance model robustness to real-world variations.

Topics

3D Human Pose Estimation
Self-Supervised Learning
Diffusion Models
Hypergraph Neural Networks
Multi-View Systems
Polystochastic Tensors
Surgical Workflow Analysis

Code references

wngTn/DisPOSE

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.