Proprioceptive-visual correspondence enables self-other distinction in humanoid robots
Summary
A novel framework enables humanoid robots to achieve self-other distinction and learn a kinematics-free 3D self-model solely from proprioceptive-visual correspondence. Evaluated on a 29-DoF Unitree G1 robot in simulated and real-world multi-agent environments, the system robustly identifies its own body with over 99.5% accuracy, significantly outperforming vision-language models like GPT-5.5 and Gemini 3.1 Pro Preview. This distinction then bootstraps a predictive self-model that maps joint configurations to 3D body occupancy, achieving fidelity comparable to oracle models. The learned self-model supports critical downstream tasks, including target reaching with an 88.0% success rate (mean best distance 51.3 mm), collision-aware motion planning with a 71.4% success rate (mean final distance 89.3 mm), and human-to-robot motion retargeting with a mean error of 36.1 mm. This approach requires no prior identity labels or kinematic models, suggesting a path for robots to acquire bodily self-representation through experience.
Key takeaway
For Robotics Engineers developing humanoid systems for shared human environments, you can now implement robust self-other distinction and kinematics-free 3D self-modeling without relying on manual kinematic models or identity labels. This approach, leveraging proprioceptive-visual correspondence, significantly improves robot autonomy and safety in multi-agent scenes, enabling advanced tasks like collision-aware planning. Consider integrating this self-supervised learning for more adaptive and socially intelligent robots.
Key insights
Proprioceptive-visual correspondence enables robots to distinguish self from others and learn a 3D self-model without prior knowledge.
Principles
- Self-other distinction precedes self-modeling.
- Temporal proprioceptive-visual co-occurrence is a sufficient self-supervision signal.
- Self-model fidelity depends directly on distinction accuracy.
Method
A two-stage framework: first, attention-guided contrastive learning for self-other distinction; then, pseudo-GT masks supervise a part-aware, pose-conditioned implicit 3D occupancy field via bounded volumetric mask rendering.
In practice
- Guide robot reaching using density-weighted hand center.
- Integrate learned occupancy for collision-aware RRT planning.
- Retarget human 3D keypoints to robot joint configurations.
Topics
- Humanoid Robotics
- Self-other Distinction
- 3D Self-Modeling
- Proprioceptive-Visual Learning
- Robot Motion Planning
- Unitree G1
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.