ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

ABot-M0 is a novel framework designed to build general-purpose embodied agents for robotic manipulation across diverse hardware, addressing the "one-brain, many-forms" challenge. It achieves this by systematically curating and standardizing six public datasets into UniACT-dataset, which comprises over 6 million trajectories and 9,500 hours of data from 20+ robot embodiments. The framework introduces Action Manifold Learning (AML), which uses a DiT backbone to directly predict clean, continuous action sequences, thereby improving action prediction efficiency and stability by projecting actions onto low-dimensional, smooth manifolds. ABot-M0 also features a modular dual-stream perception mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules like VGGT and Qwen-Image-Edit. Experimental results demonstrate that ABot-M0 achieves state-of-the-art performance on benchmarks including LIBERO (98.6% success rate), LIBERO-Plus (80.5%), RoboCasa GR1 Tabletop Tasks (58.3%), and Robotwin2.0 (over 80%), outperforming baselines such as $\pi_{0.5}$ and UniVLA.

Key takeaway

For AI Scientists and Research Scientists developing generalist robot policies, ABot-M0's approach to data unification and Action Manifold Learning offers a robust blueprint. You should consider adopting standardized delta actions in the end-effector frame and rotation vectors for action representation, as this significantly enhances cross-embodiment generalization and policy stability. Furthermore, integrating modular 3D perception can compensate for VLM limitations in spatial reasoning, leading to more precise manipulation in complex environments.

Key insights

Unified data, action manifold learning, and modular 3D perception enable generalizable robotic manipulation across diverse hardware.

Principles

Method

ABot-M0 unifies heterogeneous robotic datasets, standardizes actions to delta end-effector positions with rotation vectors, and employs Action Manifold Learning (AML) with a DiT backbone to directly predict clean action sequences, complemented by a dual-stream perception architecture.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, AI Researcher, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.