X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

X-Diffusion is a novel framework designed to train robot diffusion policies by effectively leveraging large-scale human demonstration data alongside smaller robot datasets, addressing the challenge of embodiment mismatch. Traditional co-training methods often degrade robot performance due to kinematically infeasible human actions. X-Diffusion overcomes this by training a classifier to determine at what noise level in the forward diffusion process a human action becomes indistinguishable from a robot action. Human actions are then incorporated into policy training only when they are sufficiently noised, ensuring that low-level, infeasible human movements are filtered out while high-level task guidance is preserved. Experiments across five manipulation tasks, including "Serve Egg" and "Push Plate," demonstrate that X-Diffusion achieves a 16% higher average success rate compared to the best baseline, outperforming naive co-training and even policies trained on manually filtered human data.

Key takeaway

For research scientists developing robot imitation learning systems, X-Diffusion offers a robust method to incorporate abundant human demonstration data without compromising robot kinematic feasibility. You should consider implementing a classifier-guided selective integration strategy for cross-embodiment data, as it consistently improves task success rates by filtering out detrimental low-level human actions while retaining valuable high-level task cues, even with uncurated human datasets.

Key insights

X-Diffusion selectively integrates noisy human demonstrations into robot policy training to overcome embodiment mismatches.

Principles

High noise levels abstract embodiment-specific action features.
Classifier-based filtering prevents learning infeasible robot motions.

Method

Train a classifier to predict action embodiment under noise. Integrate human actions into policy training only when the classifier cannot distinguish human from robot actions, ensuring high-level guidance without low-level kinematic conflicts.

In practice

Use Grounded-SAM 2 for object segmentation.
Employ HaMeR for 3D hand-pose estimation and retargeting.

Topics

Diffusion Policies
Cross-Embodiment Learning
Human-Robot Action Classifier
Kinematic Retargeting
Robot Manipulation

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.