Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

DICE (Document Inference via Chunk Evidence) is a novel, training-free strategy designed to improve dense retrieval performance on long documents by addressing "document-side early compression." This failure mode occurs when crucial, localized evidence within a long document is diluted during its encoding into a single vector. DICE mitigates this by splitting documents into chunks, encoding each independently with a frozen model and local position indices, then aggregating these chunk embeddings into a single document vector. This approach maintains the standard one-query-one-document retrieval interface. Evaluated on LongEmbed across Dream, Mistral, Llama3, and Qwen backbones, DICE significantly boosts retrieval, particularly for contexts beyond 4k tokens. For Dream, Passkey >4k scores increased from 30.0 to 90.0, and Needle >4k from 23.3 to 74.0. DICE also reduced the Evidence Dilution Index (EDI) in 92.8% of 12,779 samples.

Key takeaway

For Machine Learning Engineers optimizing long-document retrieval systems, you should consider implementing DICE to significantly enhance performance without retraining models. If your system struggles with "document-side early compression" on documents exceeding 4k tokens, adopting DICE's chunking and aggregation strategy can dramatically improve recall. Be prepared for a 3-4x increase in document-side encoding cost, but this trade-off is often acceptable for offline indexing where documents are queried frequently.

Key insights

Document-side chunk aggregation with local position encoding prevents evidence dilution in long-document dense retrieval.

Principles

Method

DICE splits documents into token chunks, encodes each independently with local position indices using a frozen model, then aggregates chunk embeddings into a single document vector via query-independent pooling (e.g., mean pooling).

In practice

Topics

Code references

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.