Macrodata: Robots need a data refinery
Summary
Macrodata, a new company founded by Guilherme Penedo and Hynek Kydlíček, has emerged from stealth with \$4M in pre-seed funding led by Air Street Capital. The founders, previously responsible for creating influential open-source LLM training datasets like FineWeb at Hugging Face, are now applying their data refinement expertise to robotics. This initiative addresses the critical need for well-prepared real-world data in the rapidly expanding physical AI sector, which saw record venture funding in 2025 and is projected for further growth in 2026, with companies like Figure and Skild achieving multi-billion-dollar valuations. Macrodata's initial product, Refiner, is an open-source Python library designed to process complex, multimodal physical data—including video, sensor feeds, and interleaved actions—from various formats such as LeRobot, HDF5, and Zarr. Refiner streamlines the creation of training-ready datasets by automating tasks like trimming idle motion, annotating subtasks, and scoring trajectories, offering managed cloud compute for scalable processing, exemplified by a task completing in under a minute on five H100s for approximately \$0.27.
Key takeaway
For Robotics Engineers or ML teams developing physical AI, if you are struggling with disparate robot data formats and manual preprocessing, Macrodata's Refiner provides a crucial open-source solution. Your team can adopt this Python library to standardize multimodal data, automate refinement tasks like trimming and annotation, and leverage managed cloud compute for efficient, scalable dataset preparation. This will significantly accelerate your model training and improve policy quality.
Key insights
Refined, high-quality real-world data is the next scaling frontier for physical AI and robotics.
Principles
- Data refinement is a critical, transferable skill for AI progress.
- Physical AI demands specialized tools for multimodal data processing.
- Open-source data tooling can standardize diverse robotics formats.
Method
Compose data processing pipelines locally, inspect with a multimodal viewer, then execute on managed cloud compute for scale.
In practice
- Unify robot data from LeRobot, HDF5, Zarr, MCAP formats.
- Trim idle motion and annotate subtasks in robot trajectories.
- Score trajectories using reward models or VLMs for quality.
Topics
- Macrodata
- Refiner
- Robotics Data
- Physical AI
- Data Refinement
- Open-source ML
Best for: AI Engineer, Computer Vision Engineer, Investor, Robotics Engineer, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Air Street Press.