HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

HiKEY is a hierarchical tree-based multimodal retrieval framework designed to overcome critical bottlenecks in retrieval-augmented generation (RAG) for open-domain document question answering (ODQA) on large industrial corpora. It addresses routing failure in locating correct documents and evidence fragmentation from scattered information like tables and figures. Instead of simple chunking, HiKEY reconstructs a logical heterogeneous graph using Document Hierarchical Parsing (DHP) to explicitly encode parent-child relationships. The framework employs a hierarchical coarse-to-fine strategy, first performing global routing with hierarchical indexing to prune the search space, then conducting fine-grained retrieval by ranking sections via multimodal fusion. Finally, it assembles a token-efficient evidence subgraph using a hybrid structural-semantic packing strategy. Experiments show HiKEY improves retrieval recall by up to 12.9% and end-to-end QA performance by up to 6.8% over page- and chunk-based baselines.

Key takeaway

For NLP Engineers building retrieval-augmented generation systems for complex document question answering, HiKEY offers a significant advancement. You should consider adopting hierarchical document parsing and multimodal fusion strategies to overcome routing failures and evidence fragmentation. This approach can notably improve your system's retrieval recall and overall QA performance, especially when dealing with large industrial corpora containing diverse data types like tables and figures.

Key insights

HiKEY uses document hierarchy and multimodal fusion for efficient, accurate RAG in ODQA.

Principles

Method

HiKEY reconstructs a logical heterogeneous graph via DHP, then uses hierarchical indexing for global routing, followed by multimodal fusion for fine-grained section ranking, and finally structural-semantic packing for evidence subgraph assembly.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.