Robbyant Open Sources LingBot World: a Real Time World Model for Interactive Simulation and Embodied AI
Summary
Robbyant, from Ant Group, has open-sourced LingBot World, an action-conditioned world model designed for real-time, interactive video simulations in embodied AI, driving, and gaming. This model translates text and control inputs into long-horizon simulations. It is built upon a 28B parameter mixture of experts diffusion transformer, initialized from Wan2.2, and learns dynamics from a unified data engine integrating web videos, game logs with actions, and Unreal Engine trajectories. LingBot World utilizes hierarchical captions to differentiate static layouts from motion, and incorporates actions via camera embeddings and adaptive keyboard adapters. A distilled version, LingBot World Fast, achieves approximately 16 frames per second at 480p on a single GPU node with under 1 second latency, demonstrating strong emergent memory and structural consistency, and leading VBench scores.
Key takeaway
For AI Scientists developing embodied agents or interactive simulations, LingBot World offers a robust, open-source solution for learning long-horizon dynamics. Its architecture, combining a large diffusion transformer with hierarchical captions and action conditioning, provides a significant advancement over frame-to-frame reactive models. Consider integrating LingBot World into your simulation environments to improve agent planning stability and achieve more consistent, memory-aware behaviors.
Key insights
LingBot World enables long-horizon, interactive video simulations for embodied AI using a 28B parameter diffusion transformer.
Principles
- Unified data engines improve dynamic learning.
- Hierarchical captions enhance environmental understanding.
- Action conditioning is crucial for interactive agents.
Method
LingBot World uses a 28B parameter diffusion transformer, initialized from Wan2.2, trained on a unified data engine combining web videos, game logs, and Unreal Engine trajectories with hierarchical captions.
In practice
- Utilize LingBot World for embodied agent training.
- Apply hierarchical captions for scene understanding.
- Explore distilled variants for faster inference.
Topics
- LingBot World
- World Models
- Embodied AI
- Diffusion Transformers
- Interactive Simulation
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.