VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
Summary
VinQA is a new dataset designed for long-form answer generation in real-world multimodal document QA, where visual elements are explicitly interleaved with supporting text and grounded in relevant document pages. This dataset addresses the current limitation of multimodal large language models (MLLMs) that predominantly produce text-only responses, underutilizing visual information. The research explores two encoding methods for feeding raw document page images into MLLMs: Page Encoding, which uses full-page images with bounding boxes for citable visual regions, and Modality Encoding, which parses and separately encodes text and cropped visual elements. To evaluate performance, M-GroSE, a multimodal framework extending GroUSE, was proposed, assessing completeness, relevancy, faithfulness, and unanswerability, alongside Visual Source F1 for citation accuracy. While proprietary frontier models achieve the highest scores, fine-tuning open Qwen2.5-VL models on VinQA substantially narrows this performance gap. Initially, Modality Encoding showed more robustness, but Page Encoding achieved comparable levels after training.
Key takeaway
For Machine Learning Engineers developing multimodal document QA systems, VinQA demonstrates a path to generating richer, visually-interleaved answers. You should consider fine-tuning open MLLMs like Qwen2.5-VL on specialized datasets to significantly enhance their ability to cite and integrate visual elements. This approach allows your systems to move beyond text-only responses, improving faithfulness and completeness, especially when evaluating with metrics like M-GroSE and Visual Source F1.
Key insights
MLLMs can generate long-form answers with interleaved, cited visual elements using new datasets and encoding methods.
Principles
- Multimodal document QA benefits from explicit visual element citation.
- Training on specialized datasets improves MLLM multimodal performance.
- Page Encoding can match Modality Encoding with sufficient training.
Method
Two encoding methods: Page Encoding (full-page images with bounding boxes) and Modality Encoding (separate encoding of parsed text and cropped visual elements). Evaluation uses M-GroSE and Visual Source F1.
In practice
- Fine-tune open MLLMs like Qwen2.5-VL on VinQA for multimodal QA.
- Consider Page Encoding for MLLMs after sufficient training data.
- Use M-GroSE to evaluate multimodal answer generation quality.
Topics
- Multimodal LLMs
- Document QA
- VinQA Dataset
- Visual Element Citation
- Page Encoding
- M-GroSE Evaluation
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.