World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

World-Language-Action (WLA) models are introduced as a new class of embodied foundation models that unify world modeling, language reasoning, and action synthesis. These models accept textual instructions, images, and robot states to jointly predict textual subtasks, subgoal images, and robot actions. Unlike previous World-Action Models (WAMs) that use bidirectional diffusion Transformers, WLA employs an autoregressive (AR) Transformer backbone to predict a "next state" comprising both semantic-level textual intention and fine-grained physical dynamics. The WLA-0 prototype, with 2B active parameters, achieves 40 ms inference latency on an NVIDIA RTX 5090. Evaluations show WLA-0 achieves state-of-the-art multi-task and long-horizon learning, including a 92.94% success rate on RoboTwin2.0 Clean and 56.5% on RMBench. It also demonstrates the ability to learn novel tasks directly from cross-embodiment robot videos without action annotations.

Key takeaway

For robotics engineers developing embodied AI, WLA models offer a compelling architecture for real-time control and complex task execution. You should consider WLA's autoregressive design for its efficiency and ability to handle long-horizon tasks through language-based planning and memory. Its capacity to learn from action-free, cross-embodiment videos could significantly reduce your data collection burden for novel skills.

Key insights

WLA models unify world modeling, language reasoning, and action synthesis for robust embodied AI.

Principles

Method

WLA uses an AR Transformer backbone, a World Expert for future visual state prediction (via VAE features), and an Action Expert for action generation, trained end-to-end with meta-queries.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.