Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory
Summary
The "Entity-Collision" protocol is introduced as a system-agnostic method for accurately attributing retrieval lift in agent memory benchmarks. It addresses issues like lexical leakage and tag-mixing, which confound traditional hit@k metrics. The protocol pins the BM25 floor by ensuring all distractors share the answer's entity tokens and stratifies queries by discriminator tag, allowing any performance lift over BM25 to be directly attributed to the embedder. Applied to an open-source testbed, findings reveal a 256-d hash trigram aids only closed-vocabulary lexical tags at deep collision, while MiniLM-384 generally dominates. Notably, the 2.7x-parameter BGE-large does not uniformly outperform MiniLM, excelling on intent-style queries but underperforming on lexical ones, indicating encoder capacity isn't the sole constraint. The protocol's reproducibility is ensured through version-controlled scripts and a deterministically governed memory testbed.
Key takeaway
For machine learning engineers evaluating or designing agent memory retrieval systems, you should adopt stratified evaluation protocols like Entity-Collision. This approach helps you accurately attribute performance gains to specific embedders by controlling lexical leakage and categorizing query types. Your choice of embedder should align with query characteristics; for instance, MiniLM-384 offers broad utility, while BGE-large may be better for intent-style queries, despite its larger size.
Key insights
Entity-Collision protocol isolates embedder performance by controlling lexical overlap and stratifying queries in agent memory retrieval.
Principles
- Lexical leakage confounds retriever benchmarks, requiring controlled evaluation.
- Encoder capacity alone does not guarantee superior retrieval performance.
- Stratifying queries by tag reveals nuanced embedder strengths and weaknesses.
Method
The protocol pins the BM25 floor by ensuring all distractors share the answer's entity tokens and stratifies queries by discriminator tag, attributing any lift over BM25 solely to the embedder.
In practice
- Control lexical overlap in agent memory retrieval benchmarks.
- Evaluate embedders across diverse query types (e.g., lexical vs. intent).
- Consider MiniLM-384 for broad retrieval tasks, BGE-large for intent-style queries.
Topics
- Agent Memory
- Information Retrieval
- Retrieval Benchmarks
- Embedder Evaluation
- Lexical Leakage
- BM25
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.