RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

RAD-2 is a novel generator-discriminator framework designed for high-level autonomous driving motion planners, addressing the stochastic instabilities and lack of negative feedback common in diffusion-based planners. It employs a diffusion-based generator to create diverse trajectory candidates, which are then reranked by an RL-optimized discriminator based on long-term driving quality. This decoupled approach enhances optimization stability by avoiding direct application of sparse scalar rewards to high-dimensional trajectory spaces. The framework incorporates Temporally Consistent Group Relative Policy Optimization to mitigate the credit assignment problem and On-policy Generator Optimization to guide the generator towards high-reward trajectories using closed-loop feedback. For efficient large-scale training, RAD-2 utilizes BEV-Warp, a high-throughput simulation environment that conducts closed-loop evaluations in Bird's-Eye View feature space via spatial warping. RAD-2 reduces collision rates by 56% compared to other diffusion-based planners and demonstrates improved safety and smoothness in real-world urban traffic.

Key takeaway

For autonomous driving engineers developing motion planners, RAD-2's generator-discriminator framework offers a robust approach to handling multimodal uncertainties and improving closed-loop performance. You should consider adopting a decoupled design with an RL-optimized discriminator to enhance optimization stability and reduce collision rates, potentially integrating techniques like BEV-Warp for efficient large-scale simulation and training.

Key insights

RAD-2 combines a diffusion generator with an RL discriminator for robust, multimodal autonomous driving planning.

Principles

Method

RAD-2 uses a diffusion generator for trajectory candidates and an RL discriminator for reranking. It applies Temporally Consistent Group Relative Policy Optimization and On-policy Generator Optimization within a BEV-Warp simulation environment.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.