WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
Summary
WikiVQABench is a new human-curated benchmark for knowledge-grounded Visual Question Answering (VQA), designed to evaluate vision-language models (VLMs) that require external knowledge beyond visual content. It was constructed by combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. The pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets, which are then rigorously reviewed by human annotators for factual correctness, visual-text consistency, and the necessity of external knowledge. The benchmark comprises a substantial collection of Wikipedia images with curated multiple-choice questions. Evaluation of fifteen VLMs, ranging from 256M to 90B parameters, showed a wide performance spectrum (24.7%-75.6% accuracy), confirming its effectiveness in discriminating model capabilities for knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.
Key takeaway
For Machine Learning Engineers developing or evaluating Vision-Language Models, traditional perception-focused VQA benchmarks are insufficient for real-world scenarios requiring external knowledge. You should integrate WikiVQABench into your evaluation pipeline to rigorously assess your model's ability to perform knowledge-intensive reasoning. This benchmark will help you identify specific gaps in your VLM's capacity to combine visual evidence with structured external information, guiding future model development towards more robust, knowledge-aware systems.
Key insights
WikiVQABench evaluates VLMs on knowledge-intensive VQA by requiring external knowledge beyond visual cues for correct answers.
Principles
- Real-world VQA needs external knowledge.
- Human curation ensures factual and visual consistency.
- LLMs can generate VQA candidates for human refinement.
Method
Combine Wikipedia images/captions with Wikidata. Use LLMs to generate multiple-choice VQA sets. Human annotators curate for factual correctness, visual-text consistency, and external knowledge requirement.
In practice
- Benchmark VLMs on knowledge-intensive tasks.
- Identify VLM weaknesses in external knowledge integration.
- Develop models requiring multi-modal knowledge fusion.
Topics
- Visual Question Answering
- Knowledge-Grounded AI
- Vision-Language Models
- Benchmark Datasets
- Wikipedia
- Wikidata
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.