TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

TerraTransfer introduces an innovative approach to end-to-end autonomous driving, addressing the high costs associated with traditional training methods like collecting and labeling millions of driving frames or expensive closed-loop reinforcement learning on images. This method decouples learning to drive from learning to see, leveraging self-play within vectorized simulators to pretrain a driving policy. This simulator-based training allows for millions of rollout steps per second and generates a rich distribution of challenging scenarios. The pretrained policy's latent space is then aligned with a vision backbone using action KL divergence and a batch-relational low-rank structural loss. Crucially, this process eliminates the need for curated expert demonstrations, requiring only paired (image, scene-state) datasets. The resulting end-to-end policy demonstrates performance that matches or exceeds prior methods on photorealistic 3D Gaussian splatting closed-loop scenarios.

Key takeaway

For autonomous driving engineers developing end-to-end systems, TerraTransfer offers a significant shift by reducing reliance on costly expert demonstrations and extensive data labeling. You can accelerate policy development by leveraging vectorized simulators for self-play, generating diverse training scenarios efficiently. Consider integrating this decoupled learning approach to streamline your training pipeline and achieve competitive performance with fewer real-world data constraints.

Key insights

TerraTransfer decouples driving policy learning from vision, using self-play in simulators to eliminate expert demonstrations.

Principles

Method

Pretrain a policy via self-play in vectorized simulators. Align its latent space with a pretrained vision backbone using action KL divergence and a batch-relational low-rank structural loss, requiring only (image, scene-state) pairs.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.