StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

StableHand is a novel quality-aware flow-matching framework designed to recover world-space 4D motion of two interacting hands from egocentric video, addressing challenges like hands frequently leaving camera view and severe occlusions during hand-object interactions. Existing methods struggle by uniformly conditioning on noisy observations, but StableHand accounts for per-frame reliability. It decomposes observation quality into four channels: wrist global translation and finger articulations for both hands, predicted by a learned quality network. This framework integrates quality signals into the flow-matching process via a per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization. This generative process preserves high-quality observations while reconstructing unreliable ones using a learned bimanual motion prior. StableHand achieves state-of-the-art performance on HOT3D and ARCTIC benchmarks, reducing W-MPJPE by 20-25% compared to the strongest baseline, with significant gains on heavily occluded ARCTIC sequences.

Key takeaway

For Computer Vision Engineers developing robot policy learning systems that rely on egocentric hand motion, StableHand's quality-aware flow matching offers a significant improvement in robustness. You should consider integrating explicit observation quality signals into your motion estimation pipelines, especially for scenarios with frequent occlusions or out-of-view hands, to achieve more reliable 4D motion recovery and enhance robot supervision.

Key insights

Accurate dual-hand motion estimation from egocentric video requires explicit per-frame observation quality awareness.

Principles

Method

StableHand uses a learned quality network to predict four-channel quality signals, which then modulate a flow-matching framework through a per-channel forward schedule, quality-adjusted velocity target, AdaLN, and quality-aware ODE initialization.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.