ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

· Source: Machine Learning · Field: Technology & Digital — Robotics & Autonomous Systems, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ROVE is a reinforcement learning framework designed to enhance humanoid Vision-Language-Action (VLA) models by effectively utilizing imperfect human interventions. Addressing the significant systems challenge of seamless humanoid intervention, which often yields suboptimal corrective signals due to complex whole-body kinematics and dexterous-hand control, ROVE introduces a human-in-the-loop pipeline for collecting both deployment and intervention data. The framework employs Optimistic Value Estimation (OVE) to discern and prioritize high-value behaviors from mixed-quality trajectories, preventing the VLA actor from indiscriminately imitating all actions. Furthermore, ROVE integrates cross-embodiment human experience videos to bolster value estimation, particularly for long-tailed failure and recovery scenarios. This approach results in an informative critic that guides the VLA actor towards high-value behaviors. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE consistently outperforms experience-learning baselines and shows improvement across multiple rollout-intervention iterations.

Key takeaway

For Robotics Engineers developing humanoid manipulation systems that incorporate human interventions, ROVE offers a robust framework to overcome the challenges of imperfect human data. You should consider implementing Optimistic Value Estimation (OVE) to filter and prioritize high-value behaviors from mixed-quality human demonstrations. Integrating cross-embodiment human experience videos can further enhance your system's ability to handle long-tailed failure modes, leading to more reliable and efficient VLA actor performance in real-world contact-rich tasks.

Key insights

ROVE improves humanoid VLA manipulation by using Optimistic Value Estimation and cross-embodiment videos to learn from imperfect human interventions.

Principles

Method

ROVE uses a human-in-the-loop pipeline to collect data, then applies Optimistic Value Estimation (OVE) to filter high-value behaviors. It further robustifies value estimation with cross-embodiment human experience videos.

In practice

Topics

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.