VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

2026-06-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

VinQA is a new dataset designed for long-form answer generation in real-world multimodal document QA, where visual elements are explicitly interleaved with supporting text and grounded in relevant document pages. This dataset addresses the current limitation of multimodal large language models (MLLMs) that predominantly produce text-only responses, underutilizing visual information. The research explores two encoding methods for feeding raw document page images into MLLMs: Page Encoding, which uses full-page images with bounding boxes for citable visual regions, and Modality Encoding, which parses and separately encodes text and cropped visual elements. To evaluate performance, M-GroSE, a multimodal framework extending GroUSE, was proposed, assessing completeness, relevancy, faithfulness, and unanswerability, alongside Visual Source F1 for citation accuracy. While proprietary frontier models achieve the highest scores, fine-tuning open Qwen2.5-VL models on VinQA substantially narrows this performance gap. Initially, Modality Encoding showed more robustness, but Page Encoding achieved comparable levels after training.

Key takeaway

For Machine Learning Engineers developing multimodal document QA systems, VinQA demonstrates a path to generating richer, visually-interleaved answers. You should consider fine-tuning open MLLMs like Qwen2.5-VL on specialized datasets to significantly enhance their ability to cite and integrate visual elements. This approach allows your systems to move beyond text-only responses, improving faithfulness and completeness, especially when evaluating with metrics like M-GroSE and Visual Source F1.

Key insights

MLLMs can generate long-form answers with interleaved, cited visual elements using new datasets and encoding methods.

Principles

Multimodal document QA benefits from explicit visual element citation.
Training on specialized datasets improves MLLM multimodal performance.
Page Encoding can match Modality Encoding with sufficient training.

Method

Two encoding methods: Page Encoding (full-page images with bounding boxes) and Modality Encoding (separate encoding of parsed text and cropped visual elements). Evaluation uses M-GroSE and Visual Source F1.

In practice

Fine-tune open MLLMs like Qwen2.5-VL on VinQA for multimodal QA.
Consider Page Encoding for MLLMs after sufficient training data.
Use M-GroSE to evaluate multimodal answer generation quality.

Topics

Multimodal LLMs
Document QA
VinQA Dataset
Visual Element Citation
Page Encoding
M-GroSE Evaluation

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.