EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence
Summary
The Evidence-Grounded Video Question Answering Benchmark (EG-VQA) is introduced as a new open-ended evaluation protocol for VideoQA, addressing the gap between answer correctness and evidence grounding. Comprising 2,067 videos and 11,838 QA pairs, EG-VQA explicitly annotates supporting temporal evidence, requiring joint reasoning and precise evidence localization. To assess predicted evidence, the Evidence-Grounded F1 (EG-F1) metric is proposed, measuring temporal alignment and semantic consistency. Experimental evaluations show that even strong proprietary Video-LLMs struggle with accurate prediction grounding. To improve this, EG-Reasoner, an evidence-grounded reasoning model trained with explicit supervision, is presented. EG-Reasoner achieves state-of-the-art performance among open-source models and is competitive with proprietary systems, particularly on reasoning-intensive tasks like counterfactual questions. This highlights the necessity of structured evidence supervision for robust and interpretable VideoQA systems, beyond just model scaling.
Key takeaway
For Machine Learning Engineers developing VideoQA systems, you should prioritize explicit evidence grounding beyond just answer correctness. Integrate structured evidence supervision into your model training to achieve more robust and interpretable results, especially for complex reasoning tasks like counterfactual questions. Utilize the EG-VQA benchmark and EG-F1 metric to rigorously evaluate your models' ability to localize supporting temporal evidence, ensuring your systems are truly verifiable.
Key insights
Video-LLMs need explicit evidence grounding and structured supervision, not just scaling, for robust and interpretable video question answering.
Principles
- VideoQA benchmarks need explicit evidence grounding.
- Answer correctness doesn't guarantee evidence localization.
- Structured evidence supervision enhances VideoQA robustness.
Method
EG-Reasoner is an evidence-grounded reasoning model trained with explicit supervision to bridge the gap between answer correctness and faithful evidence localization in VideoQA.
In practice
- Evaluate VideoQA models using EG-VQA benchmark.
- Measure evidence grounding with the EG-F1 metric.
- Integrate structured evidence supervision during training.
Topics
- Video Question Answering
- Video-LLMs
- Evidence Grounding
- EG-VQA Benchmark
- EG-F1 Metric
- Structured Supervision
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.