ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
Summary
ABot-M0 is a novel framework designed to build general-purpose embodied agents for robotic manipulation across diverse hardware, addressing the "one-brain, many-forms" challenge. It achieves this by systematically curating and standardizing six public datasets into UniACT-dataset, which comprises over 6 million trajectories and 9,500 hours of data from 20+ robot embodiments. The framework introduces Action Manifold Learning (AML), which uses a DiT backbone to directly predict clean, continuous action sequences, thereby improving action prediction efficiency and stability by projecting actions onto low-dimensional, smooth manifolds. ABot-M0 also features a modular dual-stream perception mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules like VGGT and Qwen-Image-Edit. Experimental results demonstrate that ABot-M0 achieves state-of-the-art performance on benchmarks including LIBERO (98.6% success rate), LIBERO-Plus (80.5%), RoboCasa GR1 Tabletop Tasks (58.3%), and Robotwin2.0 (over 80%), outperforming baselines such as $\pi_{0.5}$ and UniVLA.
Key takeaway
For AI Scientists and Research Scientists developing generalist robot policies, ABot-M0's approach to data unification and Action Manifold Learning offers a robust blueprint. You should consider adopting standardized delta actions in the end-effector frame and rotation vectors for action representation, as this significantly enhances cross-embodiment generalization and policy stability. Furthermore, integrating modular 3D perception can compensate for VLM limitations in spatial reasoning, leading to more precise manipulation in complex environments.
Key insights
Unified data, action manifold learning, and modular 3D perception enable generalizable robotic manipulation across diverse hardware.
Principles
- Effective robot actions reside on low-dimensional, smooth manifolds.
- Data scale, quality, and diversity are crucial for general-purpose VLA models.
- Systematic data engineering can achieve high-performance embodied intelligence.
Method
ABot-M0 unifies heterogeneous robotic datasets, standardizes actions to delta end-effector positions with rotation vectors, and employs Action Manifold Learning (AML) with a DiT backbone to directly predict clean action sequences, complemented by a dual-stream perception architecture.
In practice
- Convert absolute actions to relative (delta) actions for training efficiency.
- Encode rotations using rotation vectors for greater stability.
- Use multi-granularity uniform sampling to balance task and embodiment distributions.
Topics
- Robotic Manipulation
- Vision-Language-Action Models
- Action Manifold Learning
- Foundation Models
- Multi-Embodiment Learning
Code references
Best for: AI Scientist, Research Scientist, AI Researcher, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.