HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The HumanScale study, published on 2026-06-18, investigates the effectiveness of egocentric human video compared to teleoperated real-robot trajectories as pretraining data for embodied foundation models. While real-robot data offers precise action supervision, its scalability is limited by high collection costs and low diversity. The research found that egocentric human video, when processed through a carefully designed filtering and labeling pipeline, not only serves as a viable substitute but can lead to superior performance. Specifically, models pretrained on egocentric data achieved a 24% lower validation loss on real-robot action prediction. Furthermore, these models demonstrated 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively, using the same amount of pretraining data. This suggests a scalable paradigm: pretrain on diverse egocentric human video, then adapt with a small amount of labeled real-robot data.

Key takeaway

For Machine Learning Engineers developing embodied foundation models, reconsider your pretraining data strategy. If you are currently reliant on costly teleoperated real-robot data, explore egocentric human video. This approach, when combined with careful filtering and labeling, can yield significantly better performance, including 90% higher out-of-distribution task success. Prioritize building pipelines for egocentric data collection and processing to achieve scalable and diverse world representations before adapting with minimal real-robot data.

Key insights

Carefully processed egocentric human video can surprisingly outperform real-robot data for embodied foundation model pretraining, offering a scalable alternative.

Principles

Method

Pretrain embodied foundation models using filtered and labeled egocentric human video to learn diverse world representations. Subsequently, adapt with a small amount of labeled real-robot data for precise action-space alignment.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.