Driving Video Retrieval for Complex Queries with Structured Grounding

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

STRIVE-D is a new data-calibrated retrieval framework designed to improve driving video retrieval for complex dynamic events in autonomous driving. Existing vision-language, keyword-based, and rule-based methods often fail to accurately identify events like cut-ins or hard braking due to limitations in text descriptions or rule brittleness. STRIVE-D addresses this by utilizing weakly labeled in-domain videos to estimate rule reliability, adapt rules to observed data, and integrate calibrated rule scores with vision-language and keyword-based retrieval signals. This approach significantly enhances the ability to find specific dynamic events crucial for data curation and safety validation. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over current methods.

Key takeaway

For autonomous driving engineers and data curation specialists focused on safety validation, STRIVE-D offers a significant advancement in retrieving complex dynamic events. If your current vision-language or rule-based systems struggle with identifying nuanced scenarios like cut-ins, you should investigate integrating data-calibrated rule fusion. This approach promises up to 84% higher top-1 accuracy, enabling more precise data selection and robust system testing for critical driving behaviors.

Key insights

STRIVE-D improves driving video retrieval for complex events by calibrating rule-based methods with weakly labeled data and fusing signals.

Principles

Rule-based retrieval needs data calibration for robustness.
Fusing calibrated rules with vision-language improves accuracy.
Weakly labeled in-domain videos enhance rule reliability.

Method

STRIVE-D estimates query rule reliability using weakly labeled in-domain videos, adapts rules to observed data, and fuses these calibrated rule scores with vision-language and keyword-based retrieval signals.

In practice

Use STRIVE-D for autonomous driving safety validation.
Apply data calibration to brittle rule-based systems.
Integrate diverse retrieval signals for complex queries.

Topics

Driving Video Retrieval
Autonomous Driving
Vision-Language Models
Rule-Based Systems
Data Calibration
DrivingDojo Benchmark
Complex Event Detection

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.