OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

OmniDrive introduces DRIVE-CHOREO, an LLM-choreographed multi-agent world model designed for multi-view driving video generation. This model tackles challenges in autonomous driving, specifically heterogeneous control injection and post-hoc cross-view fusion, by establishing a shared symbolic interlingua that aligns language, geometry, and pixels at the latent-token level. DRIVE-CHOREO employs three Qwen2.5-VL agents—a Director, Cartographer, and Auditor—to collaboratively generate a single position-aware token sequence. This sequence is then co-compressed with multi-view video using a view-time permutation that enforces inter-camera geometry within a 3-D VAE. Evaluated on nuScenes, DRIVE-CHOREO achieved new state-of-the-art multi-view consistency and a BEV mAP of 21.6, alongside a competitive FVD of 45.7. Furthermore, a detector trained exclusively on its synthetic data demonstrated a +2.4 NDS gain on the real validation split, confirming its utility for downstream tasks.

Key takeaway

For Machine Learning Engineers developing autonomous driving simulations, OmniDrive's DRIVE-CHOREO offers a robust approach to generating highly consistent multi-view driving videos. You should consider integrating its LLM-choreographed multi-agent architecture to unify diverse control inputs and enhance 3-D geometric accuracy in your synthetic datasets. This can significantly improve downstream detector performance, as evidenced by the +2.4 NDS gain, making your models more robust on real-world validation splits.

Key insights

A multi-agent LLM world model unifies diverse driving controls and multi-view geometry through a shared latent token interlingua for video generation.

Principles

Method

DRIVE-CHOREO uses three Qwen2.5-VL agents (Director, Cartographer, Auditor) to author a position-aware token sequence. This sequence is co-compressed with multi-view video via a view-time permutation and 3-D VAE.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.