DriveCtrl: Conditioned Sim-to-Real Driving Video Generation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, quick

Summary

DriveCtrl is a novel depth-conditioned controllable sim-to-real video generation framework designed to synthesize realistic driving videos for autonomous driving systems. It addresses the significant domain gap between simulated and real-world driving data, which typically limits the utility of large-scale, fully annotated simulation data. Built on a pretrained video foundation model, DriveCtrl incorporates a structure-aware adapter that uses depth guidance to preserve scene layout and motion patterns from source simulations, ensuring temporal coherence. The framework also includes a scalable data generation pipeline that transforms simulator videos to match the visual style of a target real-world dataset, supporting structural depth, reference-dataset style, and text prompts as conditioning signals. DriveCtrl preserves frame-level annotations for downstream perception tasks and introduces a new evaluation metric, the Driving Video Realism Score (DVRS), to assess video realism. Experiments show DriveCtrl outperforms existing methods in realism, temporal quality, and perception task performance.

Key takeaway

For research scientists developing autonomous driving systems, DriveCtrl offers a robust solution to the sim-to-real domain gap. You should consider integrating depth-conditioned video generation and style transfer techniques to create more realistic training data from simulations, thereby improving the performance of downstream perception tasks. This approach can significantly reduce the need for costly real-world data collection.

Key insights

DriveCtrl bridges the sim-to-real gap for autonomous driving by generating realistic, temporally consistent, and annotation-preserving videos.

Principles

Method

DriveCtrl uses a structure-aware adapter on a pretrained video foundation model, guided by depth, reference style, and text prompts, to transform simulated driving videos into realistic footage while preserving annotations.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.