VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

VLADriveBench is a novel framework designed to evaluate the critical chain-of-thought (CoT)-action relationship in vision-language-action (VLA) models for autonomous driving. Existing benchmarks primarily focus on trajectory quality, neglecting the relevance, consistency, or causal connection of the generated CoT reasoning to the driving actions. VLADriveBench addresses this gap by integrating observational metrics, including mentioning, hallucination, contradiction, and action alignment, with a CoT intervention protocol. Applying this framework to three VLA models across two architectures revealed significant divergences: ORION achieved the highest observational alignment scores, yet its CoT was found to be epiphenomenal, while Alpamayo v1.5, despite lower scores, demonstrated a strongly causal CoT, with visual salience gating its influence.

Key takeaway

For Machine Learning Engineers developing or evaluating vision-language-action models for autonomous driving, relying solely on trajectory quality metrics is insufficient. You must explicitly assess the causal relationship between the model's chain-of-thought reasoning and its driving actions. Integrate VLADriveBench's observational metrics and CoT intervention protocol into your evaluation pipeline to ensure your model's CoT genuinely influences behavior, especially considering how visual salience might gate this influence.

Key insights

VLADriveBench evaluates the causal link between VLA model chain-of-thought and autonomous driving actions, revealing discrepancies in CoT utility.

Principles

Method

VLADriveBench combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to assess CoT-action relationships in VLA models.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.