High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

2026-06-14 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A novel system is proposed for high-fidelity 4D hand-object interaction (HOI) capture, addressing the current bottleneck of relying on pre-scanned object templates and physical markers in embodied AI and spatial computing. This system robustly and accurately reconstructs hands and objects from synchronized multi-view videos without requiring any templates or markers. It comprises two key components: a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues for reliable pose and dense object geometry initialization, and a hand-object physics-aware Gaussian-based optimization framework. This framework refines initial estimates by integrating tetrahedral constraints, collision refinement, and appearance decomposition, ensuring physically plausible and visually accurate reconstructions. Validated on public benchmarks and an extensive internal dataset, the pipeline achieves highly robust, artifact-free results, establishing an efficient foundation for automated 4D asset generation.

Key takeaway

For computer vision engineers developing embodied AI or spatial computing applications, this system offers a robust solution for generating high-fidelity 4D hand-object interaction data. You can now bypass the need for pre-scanned object templates or physical markers, streamlining data acquisition. Consider integrating this approach to automate 4D asset generation, significantly reducing manual effort and improving the realism of your interactive simulations.

Key insights

The system enables marker-free, template-free 4D hand-object capture using a transformer for initialization and physics-aware Gaussians for refinement.

Principles

Multi-view geometry and temporal cues improve pose initialization.
Physics-aware optimization enhances reconstruction plausibility.
Combining deep learning with Gaussian-based refinement is effective.

Method

The system uses a multi-view feed-forward transformer for initial pose and dense object geometry estimation, followed by a Gaussian-based optimization framework integrating tetrahedral constraints, collision refinement, and appearance decomposition.

In practice

Generate high-fidelity 4D HOI data for embodied AI.
Create automated 4D assets without manual scanning.
Reconstruct complex interactions from multi-view video.

Topics

4D Hand-Object Interaction
Multi-View Video
Gaussian Splatting
Embodied AI
Spatial Computing
Computer Vision

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.