Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

The DICE (Document Inference via Chunk Evidence) strategy addresses dense retrieval failures in long documents caused by "document-side early compression," where crucial information is weakened during encoding. This training-free document-side approach splits documents into chunks, encodes them independently using a frozen model, and then aggregates these into a single vector, maintaining the standard one-query-one-document interface. Evaluated on LongEmbed, DICE significantly improves retrieval across four backbones, with notable gains for documents exceeding 4k tokens. For instance, Dream's Passkey >4k score rose from 30.0 to 90.0, and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE achieved lower Evidence Dilution Index (EDI) than the single-vector baseline in 92.8% of cases, highlighting document-level encoding as a practical, underexplored lever.

Key takeaway

For Machine Learning Engineers or AI Scientists building long-document retrieval systems, if you are encountering performance degradation on lengthy texts, consider implementing the DICE strategy. This training-free method offers substantial improvements, particularly for documents over 4k tokens, by effectively mitigating "evidence dilution." Integrating DICE can significantly boost your system's accuracy without requiring model retraining, making it a practical enhancement for robust information retrieval.

Key insights

Improving long-document retrieval by aggregating independently encoded chunk-level evidence.

Principles

Method

Split documents into chunks, encode independently with a frozen model, then aggregate into a single vector.

In practice

Topics

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.