Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The "Entity-Collision" protocol is introduced as a system-agnostic method for accurately attributing retrieval lift in agent memory benchmarks. It addresses issues like lexical leakage and tag-mixing, which confound traditional hit@k metrics. The protocol pins the BM25 floor by ensuring all distractors share the answer's entity tokens and stratifies queries by discriminator tag, allowing any performance lift over BM25 to be directly attributed to the embedder. Applied to an open-source testbed, findings reveal a 256-d hash trigram aids only closed-vocabulary lexical tags at deep collision, while MiniLM-384 generally dominates. Notably, the 2.7x-parameter BGE-large does not uniformly outperform MiniLM, excelling on intent-style queries but underperforming on lexical ones, indicating encoder capacity isn't the sole constraint. The protocol's reproducibility is ensured through version-controlled scripts and a deterministically governed memory testbed.

Key takeaway

For machine learning engineers evaluating or designing agent memory retrieval systems, you should adopt stratified evaluation protocols like Entity-Collision. This approach helps you accurately attribute performance gains to specific embedders by controlling lexical leakage and categorizing query types. Your choice of embedder should align with query characteristics; for instance, MiniLM-384 offers broad utility, while BGE-large may be better for intent-style queries, despite its larger size.

Key insights

Entity-Collision protocol isolates embedder performance by controlling lexical overlap and stratifying queries in agent memory retrieval.

Principles

Lexical leakage confounds retriever benchmarks, requiring controlled evaluation.
Encoder capacity alone does not guarantee superior retrieval performance.
Stratifying queries by tag reveals nuanced embedder strengths and weaknesses.

Method

The protocol pins the BM25 floor by ensuring all distractors share the answer's entity tokens and stratifies queries by discriminator tag, attributing any lift over BM25 solely to the embedder.

In practice

Control lexical overlap in agent memory retrieval benchmarks.
Evaluate embedders across diverse query types (e.g., lexical vs. intent).
Consider MiniLM-384 for broad retrieval tasks, BGE-large for intent-style queries.

Topics

Agent Memory
Information Retrieval
Retrieval Benchmarks
Embedder Evaluation
Lexical Leakage
BM25

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.