Macrodata: Robots need a data refinery

· Source: Air Street Press · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Intermediate, short

Summary

Macrodata, a new company founded by Guilherme Penedo and Hynek Kydlíček, has emerged from stealth with \$4M in pre-seed funding led by Air Street Capital. The founders, previously responsible for creating influential open-source LLM training datasets like FineWeb at Hugging Face, are now applying their data refinement expertise to robotics. This initiative addresses the critical need for well-prepared real-world data in the rapidly expanding physical AI sector, which saw record venture funding in 2025 and is projected for further growth in 2026, with companies like Figure and Skild achieving multi-billion-dollar valuations. Macrodata's initial product, Refiner, is an open-source Python library designed to process complex, multimodal physical data—including video, sensor feeds, and interleaved actions—from various formats such as LeRobot, HDF5, and Zarr. Refiner streamlines the creation of training-ready datasets by automating tasks like trimming idle motion, annotating subtasks, and scoring trajectories, offering managed cloud compute for scalable processing, exemplified by a task completing in under a minute on five H100s for approximately \$0.27.

Key takeaway

For Robotics Engineers or ML teams developing physical AI, if you are struggling with disparate robot data formats and manual preprocessing, Macrodata's Refiner provides a crucial open-source solution. Your team can adopt this Python library to standardize multimodal data, automate refinement tasks like trimming and annotation, and leverage managed cloud compute for efficient, scalable dataset preparation. This will significantly accelerate your model training and improve policy quality.

Key insights

Refined, high-quality real-world data is the next scaling frontier for physical AI and robotics.

Principles

Method

Compose data processing pipelines locally, inspect with a multimodal viewer, then execute on managed cloud compute for scale.

In practice

Topics

Best for: AI Engineer, Computer Vision Engineer, Investor, Robotics Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Air Street Press.