Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation
Summary
DICE (Document Inference via Chunk Evidence) is a novel, training-free strategy designed to improve dense retrieval performance on long documents by addressing "document-side early compression." This failure mode occurs when crucial, localized evidence within a long document is diluted during its encoding into a single vector. DICE mitigates this by splitting documents into chunks, encoding each independently with a frozen model and local position indices, then aggregating these chunk embeddings into a single document vector. This approach maintains the standard one-query-one-document retrieval interface. Evaluated on LongEmbed across Dream, Mistral, Llama3, and Qwen backbones, DICE significantly boosts retrieval, particularly for contexts beyond 4k tokens. For Dream, Passkey >4k scores increased from 30.0 to 90.0, and Needle >4k from 23.3 to 74.0. DICE also reduced the Evidence Dilution Index (EDI) in 92.8% of 12,779 samples.
Key takeaway
For Machine Learning Engineers optimizing long-document retrieval systems, you should consider implementing DICE to significantly enhance performance without retraining models. If your system struggles with "document-side early compression" on documents exceeding 4k tokens, adopting DICE's chunking and aggregation strategy can dramatically improve recall. Be prepared for a 3-4x increase in document-side encoding cost, but this trade-off is often acceptable for offline indexing where documents are queried frequently.
Key insights
Document-side chunk aggregation with local position encoding prevents evidence dilution in long-document dense retrieval.
Principles
- Localized evidence is diluted by single-vector compression.
- Delaying compression improves long-document retrieval.
- Chunk granularity is a decisive design factor.
Method
DICE splits documents into token chunks, encodes each independently with local position indices using a frozen model, then aggregates chunk embeddings into a single document vector via query-independent pooling (e.g., mean pooling).
In practice
- Use chunk size 1024 for optimal average performance.
- Prefer mean pooling for robust aggregation across tasks.
- Reset position indices locally within each chunk.
Topics
- Long-Document Retrieval
- Dense Retrieval
- Chunk Evidence Aggregation
- Evidence Dilution Index
- DICE Algorithm
- LLM Embedders
- LongEmbed Benchmark
Code references
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.