HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering
Summary
HiKEY is a hierarchical tree-based multimodal retrieval framework designed to overcome critical bottlenecks in retrieval-augmented generation (RAG) for open-domain document question answering (ODQA) on large industrial corpora. It addresses routing failure in locating correct documents and evidence fragmentation from scattered information like tables and figures. Instead of simple chunking, HiKEY reconstructs a logical heterogeneous graph using Document Hierarchical Parsing (DHP) to explicitly encode parent-child relationships. The framework employs a hierarchical coarse-to-fine strategy, first performing global routing with hierarchical indexing to prune the search space, then conducting fine-grained retrieval by ranking sections via multimodal fusion. Finally, it assembles a token-efficient evidence subgraph using a hybrid structural-semantic packing strategy. Experiments show HiKEY improves retrieval recall by up to 12.9% and end-to-end QA performance by up to 6.8% over page- and chunk-based baselines.
Key takeaway
For NLP Engineers building retrieval-augmented generation systems for complex document question answering, HiKEY offers a significant advancement. You should consider adopting hierarchical document parsing and multimodal fusion strategies to overcome routing failures and evidence fragmentation. This approach can notably improve your system's retrieval recall and overall QA performance, especially when dealing with large industrial corpora containing diverse data types like tables and figures.
Key insights
HiKEY uses document hierarchy and multimodal fusion for efficient, accurate RAG in ODQA.
Principles
- Document hierarchy improves retrieval precision.
- Multimodal fusion enhances evidence integration.
- Coarse-to-fine retrieval prunes search space.
Method
HiKEY reconstructs a logical heterogeneous graph via DHP, then uses hierarchical indexing for global routing, followed by multimodal fusion for fine-grained section ranking, and finally structural-semantic packing for evidence subgraph assembly.
In practice
- Apply DHP for complex document structures.
- Integrate multimodal data for richer context.
- Use hierarchical indexing for large corpora.
Topics
- Retrieval-Augmented Generation
- Multimodal Retrieval
- Document Question Answering
- Hierarchical Indexing
- Document Parsing
- Open-Domain QA
Best for: Research Scientist, AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.