Collecting robot training data is dirty, unglamorous work. Some AI labs are already paying XDOF to do it.
Summary
XDOF, a new startup, has raised \$70 million from Thrive Capital, Spark Capital, a16z, Lux, and WndrCo to address the critical data bottleneck in robotics, a field seeing renewed interest from major AI labs like OpenAI. Unlike language models, robots require vast amounts of physical interaction training data, which is currently scarce. XDOF aims to provide the necessary data pipelines, collection tools, and annotation systems for frontier labs and robotics companies, already serving 20 customers. Co-founder Philipp Wu experienced this data scarcity firsthand, leading to the development of GELLO, a teleoperation system for data generation. The company is also partnering with UC Berkeley's AI Research lab to release ABC, a dataset comprising 130,000 trajectories, 300 hours of simulation, and 100 hours of evaluations. XDOF plans a three-tiered data collection strategy, including teleoperation and egocentric data, and will manage the labor-intensive process of hiring and training operators, a task AI labs prefer to outsource.
Key takeaway
For Directors of AI/ML or Robotics Engineers building physical AI systems, recognize that high-quality, large-scale training data is the new critical bottleneck, not just model architectures or compute. You should evaluate specialized data infrastructure providers like XDOF to accelerate your robot development. Outsourcing the labor-intensive process of data collection and annotation allows your teams to focus on core model training and deployment, mitigating the risk of falling behind in the rapidly advancing physical AI frontier.
Key insights
Robotics AI development is bottlenecked by scarce physical interaction data, driving demand for specialized data collection and annotation infrastructure.
Principles
- Physical interaction data is crucial for robotics.
- Hardware design directly affects data quality.
- Open data releases foster community innovation.
Method
XDOF builds data pipelines, collection, and annotation systems. Its method uses a three-tier pyramid: teleoperation on deployment robots, general teleoperated data, and egocentric data from wearable sensors, supported by trained operators.
In practice
- Employ teleoperation for robot data generation.
- Outsource large-scale robotics data collection.
- Contribute to open robotics datasets.
Topics
- Robotics Data
- AI Training Data
- Teleoperation
- Foundation Models
- Data Annotation
- XDOF
Best for: Investor, CTO, VP of Engineering/Data, AI Scientist, Robotics Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI News & Artificial Intelligence | TechCrunch.