Human-like autonomy emerges from self-play and a pinch of human data
Summary
A novel method called "spiced self-play" enables the training of human-like autonomous driving policies using a minimal amount of human demonstration data. This approach integrates self-play reinforcement learning with human data as a regularization objective, requiring only 30 minutes of human demonstrations—2500 times less than state-of-the-art imitation learning baselines. The resulting policies demonstrate superior coordination with human trajectories, achieving an at-fault collision rate of 0.6-0.7%, a 2.5x improvement over SMART-tiny-CLSFT's 1.6% on the full Waymo dataset. Training completes in 15 hours on a single consumer-grade GPU, utilizing 20 billion simulated transitions (approximately 63 years of driving experience). This technique avoids labor-intensive reward engineering and domain randomization, yielding policies with lower collision severity and more social driving behaviors.
Key takeaway
For autonomous driving engineers aiming to develop human-compatible policies while minimizing expensive human data collection, you should consider integrating "spiced self-play." This method allows you to achieve robust, human-like driving behaviors with as little as 30 minutes of human demonstrations, drastically reducing the data burden compared to traditional imitation learning. You can leverage vast simulated experience to accelerate training, completing policy development in just 15 hours on a single consumer-grade GPU, thereby optimizing resource allocation and development timelines.
Key insights
Minimal human data effectively regularizes large-scale self-play RL for human-compatible autonomous driving policies.
Principles
- Pure self-play can lead to effective but non-human-like behaviors.
- Human demonstrations serve as a lightweight behavioral anchor.
- Diverse training scenarios are essential for policy generalization.
Method
Train a behavioral cloning anchor on human data, then apply it as a KL regularization penalty during Proximal Policy Optimization (PPO) self-play.
In practice
- Use 30 minutes of human driving data for behavioral anchoring.
- Train on 50,000 diverse scenarios for robust generalization.
- Implement KL regularization to guide self-play policies.
Topics
- Self-play Reinforcement Learning
- Autonomous Driving
- Human-AI Coordination
- Behavioral Cloning
- Waymo Open Motion Dataset
- PufferDrive Simulator
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.