ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

2026-02-13 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

ABot-M0 is a novel framework designed to build general-purpose embodied agents for robotic manipulation across diverse hardware, addressing the "one-brain, many-forms" challenge. It achieves this by systematically curating and standardizing six public datasets into UniACT-dataset, which comprises over 6 million trajectories and 9,500 hours of data from 20+ robot embodiments. The framework introduces Action Manifold Learning (AML), which uses a DiT backbone to directly predict clean, continuous action sequences, thereby improving action prediction efficiency and stability by projecting actions onto low-dimensional, smooth manifolds. ABot-M0 also features a modular dual-stream perception mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules like VGGT and Qwen-Image-Edit. Experimental results demonstrate that ABot-M0 achieves state-of-the-art performance on benchmarks including LIBERO (98.6% success rate), LIBERO-Plus (80.5%), RoboCasa GR1 Tabletop Tasks (58.3%), and Robotwin2.0 (over 80%), outperforming baselines such as $\pi_{0.5}$ and UniVLA.

Key takeaway

For AI Scientists and Research Scientists developing generalist robot policies, ABot-M0's approach to data unification and Action Manifold Learning offers a robust blueprint. You should consider adopting standardized delta actions in the end-effector frame and rotation vectors for action representation, as this significantly enhances cross-embodiment generalization and policy stability. Furthermore, integrating modular 3D perception can compensate for VLM limitations in spatial reasoning, leading to more precise manipulation in complex environments.

Key insights

Unified data, action manifold learning, and modular 3D perception enable generalizable robotic manipulation across diverse hardware.

Principles

Effective robot actions reside on low-dimensional, smooth manifolds.
Data scale, quality, and diversity are crucial for general-purpose VLA models.
Systematic data engineering can achieve high-performance embodied intelligence.

Method

ABot-M0 unifies heterogeneous robotic datasets, standardizes actions to delta end-effector positions with rotation vectors, and employs Action Manifold Learning (AML) with a DiT backbone to directly predict clean action sequences, complemented by a dual-stream perception architecture.

In practice

Convert absolute actions to relative (delta) actions for training efficiency.
Encode rotations using rotation vectors for greater stability.
Use multi-granularity uniform sampling to balance task and embodiment distributions.

Topics

Robotic Manipulation
Vision-Language-Action Models
Action Manifold Learning
Foundation Models
Multi-Embodiment Learning

Code references

Best for: AI Scientist, Research Scientist, AI Researcher, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.