Chain of World: World Model Thinking in Latent Motion
Summary
CoWVLA (Chain-of-World VLA) introduces a novel "Chain of World" paradigm to address limitations in existing Vision-Language-Action (VLA) models, which either waste capacity on redundant backgrounds or lack continuous temporal dynamics. This approach unifies world-model temporal reasoning with a disentangled latent motion representation, leveraging a pretrained video VAE to factorize video segments into structure and motion latents. During pre-training, the VLA infers a continuous latent motion chain and predicts terminal frames from instructions and initial frames. Subsequently, co-fine-tuning aligns this latent dynamic with discrete action prediction using a unified autoregressive decoder, preserving temporal reasoning and world knowledge while maintaining latent action compactness and interpretability. Extensive experiments on robotic simulation benchmarks demonstrate that CoWVLA outperforms current world-model and latent-action methods, achieving moderate computational efficiency and highlighting its potential for effective visuomotor learning.
Key takeaway
CoWVLA (Chain-of-World VLA) introduces a novel paradigm that unifies world-model temporal reasoning with disentangled latent motion, addressing limitations in current Vision-Language-Action models. It leverages a video VAE for latent motion extraction and an autoregressive decoder to align continuous latent dynamics with discrete actions, outperforming existing world-model and latent-action approaches on robotic simulation benchmarks. This design enables efficient visuomotor learning and provides a robust pretraining paradigm for developing advanced embodied AI.
Topics
- Vision-Language-Action Models
- World Models
- Latent Motion Representation
- Robotic Simulation
- Visuomotor Learning
Code references
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.