Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Robotics & Autonomous Systems, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A novel framework enables humanoid robots to achieve self-other distinction and learn a kinematics-free 3D self-model solely from proprioceptive-visual correspondence. Evaluated on a 29-DoF Unitree G1 robot in simulated and real-world multi-agent environments, the system robustly identifies its own body with over 99.5% accuracy, significantly outperforming vision-language models like GPT-5.5 and Gemini 3.1 Pro Preview. This distinction then bootstraps a predictive self-model that maps joint configurations to 3D body occupancy, achieving fidelity comparable to oracle models. The learned self-model supports critical downstream tasks, including target reaching with an 88.0% success rate (mean best distance 51.3 mm), collision-aware motion planning with a 71.4% success rate (mean final distance 89.3 mm), and human-to-robot motion retargeting with a mean error of 36.1 mm. This approach requires no prior identity labels or kinematic models, suggesting a path for robots to acquire bodily self-representation through experience.

Key takeaway

For Robotics Engineers developing humanoid systems for shared human environments, you can now implement robust self-other distinction and kinematics-free 3D self-modeling without relying on manual kinematic models or identity labels. This approach, leveraging proprioceptive-visual correspondence, significantly improves robot autonomy and safety in multi-agent scenes, enabling advanced tasks like collision-aware planning. Consider integrating this self-supervised learning for more adaptive and socially intelligent robots.

Key insights

Proprioceptive-visual correspondence enables robots to distinguish self from others and learn a 3D self-model without prior knowledge.

Principles

Self-other distinction precedes self-modeling.
Temporal proprioceptive-visual co-occurrence is a sufficient self-supervision signal.
Self-model fidelity depends directly on distinction accuracy.

Method

A two-stage framework: first, attention-guided contrastive learning for self-other distinction; then, pseudo-GT masks supervise a part-aware, pose-conditioned implicit 3D occupancy field via bounded volumetric mask rendering.

In practice

Guide robot reaching using density-weighted hand center.
Integrate learned occupancy for collision-aware RRT planning.
Retarget human 3D keypoints to robot joint configurations.

Topics

Humanoid Robotics
Self-other Distinction
3D Self-Modeling
Proprioceptive-Visual Learning
Robot Motion Planning
Unitree G1

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.