DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

DisPOSE is a self-supervised framework designed for multi-view 3D human pose estimation, specifically addressing the challenge of analyzing interacting behaviors and improving generalization in real-world scenarios. Unlike existing methods that rely on synthetic 3D pose catalogs, DisPOSE approximates the discrete multi-view person-assignment problem as a generative diffusion process over polystochastic tensors. It employs differentiable Sinkhorn projections during denoising to guide solutions toward valid assignments based on 2D image priors. A Hypergraph-Convolutional Decoder then regresses the complete 3D skeletons, explicitly modeling relational structures and articulated joints. The approach outperforms current self-supervised methods on standard datasets and demonstrates strong performance on a new benchmark featuring highly occluded scenes from surgical operating rooms. DisPOSE also exhibits high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels, and is nearly agnostic to different camera arrangements.

Key takeaway

For computer vision engineers developing multi-person 3D pose estimation systems, DisPOSE offers a robust self-supervised approach that overcomes generalization issues in real-world, occluded environments. Its diffusion-based person assignment and camera-agnostic design mean you can achieve high accuracy with significantly less labeled data, even in complex settings like surgical operating rooms. Consider integrating its principles to enhance model robustness and reduce annotation burdens.

Key insights

DisPOSE uses projected polystochastic diffusion for self-supervised multi-view 3D human pose estimation, improving real-world generalization.

Principles

Method

DisPOSE approximates multi-view person assignment as a generative diffusion process, using differentiable Sinkhorn projections for denoising. A Hypergraph-Convolutional Decoder then regresses 3D skeletons, explicitly modeling relational structures.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.