Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation
Summary
The DICE (Document Inference via Chunk Evidence) strategy addresses dense retrieval failures in long documents caused by "document-side early compression," where crucial information is weakened during encoding. This training-free document-side approach splits documents into chunks, encodes them independently using a frozen model, and then aggregates these into a single vector, maintaining the standard one-query-one-document interface. Evaluated on LongEmbed, DICE significantly improves retrieval across four backbones, with notable gains for documents exceeding 4k tokens. For instance, Dream's Passkey >4k score rose from 30.0 to 90.0, and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE achieved lower Evidence Dilution Index (EDI) than the single-vector baseline in 92.8% of cases, highlighting document-level encoding as a practical, underexplored lever.
Key takeaway
For Machine Learning Engineers or AI Scientists building long-document retrieval systems, if you are encountering performance degradation on lengthy texts, consider implementing the DICE strategy. This training-free method offers substantial improvements, particularly for documents over 4k tokens, by effectively mitigating "evidence dilution." Integrating DICE can significantly boost your system's accuracy without requiring model retraining, making it a practical enhancement for robust information retrieval.
Key insights
Improving long-document retrieval by aggregating independently encoded chunk-level evidence.
Principles
- Dense retrieval can fail on long documents due to early compression
- Chunk-level evidence aggregation overcomes evidence dilution
- Document-level encoding is an underexplored lever
Method
Split documents into chunks, encode independently with a frozen model, then aggregate into a single vector.
In practice
- Apply DICE for long-document retrieval tasks
- Target documents >4k tokens for maximal performance gains
- Utilize frozen models for efficient chunk encoding
Topics
- Dense Retrieval
- Long Document Retrieval
- Chunk Evidence Aggregation
- Document Encoding
- DICE Strategy
- Information Retrieval
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.