DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

2026-04-22 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, quick

Summary

DeVI (Dexterous Video Imitation) is a new framework that uses text-conditioned synthetic videos to enable physically plausible dexterous agent control for human-object interaction (HOI) with novel objects. While synthetic videos offer rich interaction knowledge, their 2D nature and limited physical fidelity make direct use challenging for physics-based character control. DeVI addresses this by integrating 3D human tracking with robust 2D object tracking through a hybrid tracking reward, overcoming the imprecision of generative 2D cues. Unlike methods requiring high-quality 3D kinematic demonstrations, DeVI operates solely on generated video, facilitating zero-shot generalization across diverse objects and interaction types. Experiments show DeVI surpasses existing 3D HOI imitation approaches, especially for dexterous hand-object interactions, and is effective in multi-object scenes and for text-driven action diversity.

Key takeaway

For research scientists developing dexterous robotic manipulation systems, DeVI offers a novel approach to leverage readily available synthetic video data. You should consider integrating DeVI's hybrid 2D/3D tracking reward mechanism to overcome the limitations of purely 2D generative cues, enabling more robust and generalizable control for complex human-object interactions without needing extensive 3D kinematic demonstrations.

Key insights

DeVI enables physics-based dexterous robot control using synthetic 2D videos and a hybrid 2D/3D tracking reward.

Principles

2D video can guide 3D physics-based control.
Hybrid 2D/3D tracking improves physical fidelity.

Method

DeVI uses text-conditioned synthetic videos, integrating 3D human tracking with 2D object tracking via a hybrid reward to achieve physically plausible dexterous agent control.

In practice

Apply DeVI for zero-shot generalization in robotics.
Use synthetic video as an HOI-aware motion planner.

Topics

DeVI Framework
Dexterous Manipulation
Human-Object Interaction
Synthetic Video Imitation
Physics-based Control

Best for: Research Scientist, AI Scientist, Robotics Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.