HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
Summary
The paper "HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining" investigates the effectiveness of egocentric human video versus teleoperated real-robot trajectories for pretraining embodied foundation models. The authors found that egocentric data, when processed through a carefully designed filtering and labeling pipeline, can lead to superior performance compared to real-robot data. Specifically, models pretrained on egocentric data achieved a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively, using the same amount of pretraining data. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment.
Key takeaway
For AI Scientists and Machine Learning Engineers developing embodied foundation models, consider prioritizing egocentric human video for pretraining. This approach offers a scalable, cost-effective alternative to traditional real-robot data, demonstrating superior performance in action prediction and task execution. You should integrate a robust filtering and labeling pipeline for egocentric data, then use a small amount of real-robot data solely for action-space alignment, significantly improving model generalization and reducing data collection costs.
Key insights
Egocentric human video, properly processed, can outperform real-robot data for embodied foundation model pretraining.
Principles
- Egocentric human video offers scalable, diverse pretraining data.
- Real-robot data collection is costly and diversity-limited.
- Pretraining on diverse data improves generalization.
Method
Pretrain embodied foundation models using egocentric human video, processed via a designed filtering and labeling pipeline, then fine-tune with a small amount of labeled real-robot data for action-space alignment.
In practice
- Filter and label egocentric video for quality.
- Use egocentric data for initial model pretraining.
- Adapt with minimal real-robot data for alignment.
Topics
- Embodied AI
- Foundation Models
- Egocentric Video
- Robot Learning
- Data Pretraining
- Human Demonstrations
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.