WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

2026-05-20 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

WikiVQABench is a new human-curated benchmark for knowledge-grounded Visual Question Answering (VQA), designed to evaluate vision-language models (VLMs) that require external knowledge beyond visual content. It was constructed by combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. The pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets, which are then rigorously reviewed by human annotators for factual correctness, visual-text consistency, and the necessity of external knowledge. The benchmark comprises a substantial collection of Wikipedia images with curated multiple-choice questions. Evaluation of fifteen VLMs, ranging from 256M to 90B parameters, showed a wide performance spectrum (24.7%-75.6% accuracy), confirming its effectiveness in discriminating model capabilities for knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

Key takeaway

For Machine Learning Engineers developing or evaluating Vision-Language Models, traditional perception-focused VQA benchmarks are insufficient for real-world scenarios requiring external knowledge. You should integrate WikiVQABench into your evaluation pipeline to rigorously assess your model's ability to perform knowledge-intensive reasoning. This benchmark will help you identify specific gaps in your VLM's capacity to combine visual evidence with structured external information, guiding future model development towards more robust, knowledge-aware systems.

Key insights

WikiVQABench evaluates VLMs on knowledge-intensive VQA by requiring external knowledge beyond visual cues for correct answers.

Principles

Real-world VQA needs external knowledge.
Human curation ensures factual and visual consistency.
LLMs can generate VQA candidates for human refinement.

Method

Combine Wikipedia images/captions with Wikidata. Use LLMs to generate multiple-choice VQA sets. Human annotators curate for factual correctness, visual-text consistency, and external knowledge requirement.

In practice

Benchmark VLMs on knowledge-intensive tasks.
Identify VLM weaknesses in external knowledge integration.
Develop models requiring multi-modal knowledge fusion.

Topics

Visual Question Answering
Knowledge-Grounded AI
Vision-Language Models
Benchmark Datasets
Wikipedia
Wikidata

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.