World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

World-Language-Action (WLA) models are introduced as a new class of embodied foundation models that unify world modeling, language reasoning, and action synthesis. These models accept textual instructions, images, and robot states to jointly predict textual subtasks, subgoal images, and robot actions. Unlike previous World-Action Models (WAMs) that use bidirectional diffusion Transformers, WLA employs an autoregressive (AR) Transformer backbone to predict a "next state" comprising both semantic-level textual intention and fine-grained physical dynamics. The WLA-0 prototype, with 2B active parameters, achieves 40 ms inference latency on an NVIDIA RTX 5090. Evaluations show WLA-0 achieves state-of-the-art multi-task and long-horizon learning, including a 92.94% success rate on RoboTwin2.0 Clean and 56.5% on RMBench. It also demonstrates the ability to learn novel tasks directly from cross-embodiment robot videos without action annotations.

Key takeaway

For robotics engineers developing embodied AI, WLA models offer a compelling architecture for real-time control and complex task execution. You should consider WLA's autoregressive design for its efficiency and ability to handle long-horizon tasks through language-based planning and memory. Its capacity to learn from action-free, cross-embodiment videos could significantly reduce your data collection burden for novel skills.

Key insights

WLA models unify world modeling, language reasoning, and action synthesis for robust embodied AI.

Principles

Next state prediction should combine high-level textual intention and low-level physical dynamics.
Autoregressive Transformers can unify language generation and physical dynamics modeling.
Implicit parameter updates for world prediction allow disabling it during inference for efficiency.

Method

WLA uses an AR Transformer backbone, a World Expert for future visual state prediction (via VAE features), and an Action Expert for action generation, trained end-to-end with meta-queries.

In practice

Use lightweight diffusion Transformers for World Expert.
Predict static future frames, not full video clips, for physical dynamics.
Employ test-time scaling (TTS) with value models for improved control.

Topics

Embodied AI
World Models
Language Reasoning
Action Synthesis
Robot Learning
Real-time Control

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.