VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning
Summary
VeriDrive is a novel framework for constructing planning-oriented, verifiable counterfactual supervision for vision-language driving models. It addresses the high cost and unverifiable nature of existing reasoning supervision by converting driving reasoning into a structured Perception–Evaluation–Revision chain. This chain grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, and revises risky intent towards expert behavior, producing final planning targets. To scale data construction, VeriDrive combines low-cost local generation using Qwen3-VL-32B-Instruct with validator-guided selective correction, escalating only invalid or difficult samples (approximately 30%) to a high-quality generator like GPT-5.5. The VeriDrive dataset, built on nuScenes, contains 27,271 training and 5,868 validation samples. Experiments show VeriDrive improves open-loop planning metrics, reducing average Collision from 0.30 to 0.2411 and Intersection from 3.00 to 2.0328 compared to OmniDrive. It also significantly cuts logged token usage from 256.6M to 157.3M, generation time from 290.0 hours to 208.2 hours, and estimated paid GPT API cost from \$2.69k to \$0.54k.
Key takeaway
For Machine Learning Engineers developing autonomous driving VLMs, you should integrate verifiable counterfactual supervision to enhance planning performance and reduce data generation costs. By structuring reasoning into auditable Perception–Evaluation–Revision steps and employing a budget-aware hybrid generation pipeline, you can achieve better safety-proxy metrics like Collision and Intersection rates while significantly cutting API expenses. Consider implementing rule-based validators to ensure supervision quality and cost-efficiency.
Key insights
Structured, verifiable counterfactual supervision significantly improves vision-language planning efficiency and performance in autonomous driving.
Principles
- Decompose reasoning into auditable, rule-checkable stages.
- Ground counterfactuals in explicit scene evidence.
- Prioritize revision of intent over direct trajectory replacement.
Method
VeriDrive uses a Perception–Evaluation–Revision chain, combining low-cost local generation with validator-guided selective correction, escalating only invalid or complex samples to high-quality LLMs.
In practice
- Implement rule-based validators for reasoning chains.
- Use hybrid generation for cost-efficient data scaling.
- Focus supervision on intent-level decision correction.
Topics
- Vision-Language Models
- Autonomous Driving
- Counterfactual Supervision
- Planning Algorithms
- Dataset Generation
- nuScenes
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.