OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation
Summary
OmniDrive introduces DRIVE-CHOREO, an LLM-choreographed multi-agent world model designed for multi-view driving video generation. This model tackles challenges in autonomous driving, specifically heterogeneous control injection and post-hoc cross-view fusion, by establishing a shared symbolic interlingua that aligns language, geometry, and pixels at the latent-token level. DRIVE-CHOREO employs three Qwen2.5-VL agents—a Director, Cartographer, and Auditor—to collaboratively generate a single position-aware token sequence. This sequence is then co-compressed with multi-view video using a view-time permutation that enforces inter-camera geometry within a 3-D VAE. Evaluated on nuScenes, DRIVE-CHOREO achieved new state-of-the-art multi-view consistency and a BEV mAP of 21.6, alongside a competitive FVD of 45.7. Furthermore, a detector trained exclusively on its synthetic data demonstrated a +2.4 NDS gain on the real validation split, confirming its utility for downstream tasks.
Key takeaway
For Machine Learning Engineers developing autonomous driving simulations, OmniDrive's DRIVE-CHOREO offers a robust approach to generating highly consistent multi-view driving videos. You should consider integrating its LLM-choreographed multi-agent architecture to unify diverse control inputs and enhance 3-D geometric accuracy in your synthetic datasets. This can significantly improve downstream detector performance, as evidenced by the +2.4 NDS gain, making your models more robust on real-world validation splits.
Key insights
A multi-agent LLM world model unifies diverse driving controls and multi-view geometry through a shared latent token interlingua for video generation.
Principles
- Shared symbolic interlingua aligns language, geometry, pixels.
- Latent choreography enables controllable multi-view video generation.
- Multi-agent LLM collaboration enhances world model coherence.
Method
DRIVE-CHOREO uses three Qwen2.5-VL agents (Director, Cartographer, Auditor) to author a position-aware token sequence. This sequence is co-compressed with multi-view video via a view-time permutation and 3-D VAE.
In practice
- Generate synthetic driving data for detector training.
- Improve multi-view consistency in generated scenes.
- Integrate LLM agents for complex scene choreography.
Topics
- Multi-Agent Systems
- World Models
- Autonomous Driving
- Video Generation
- Latent Co-Compression
- LLM Choreography
- nuScenes
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.