Filtered ANN as a Phase Transition: When Selectivity-Estimation Error Causes Plan Regret
Summary
A filtered approximate-nearest-neighbor (ANN) query returns the k nearest vectors among those satisfying an attribute predicate P of selectivity s. The optimal execution strategy—pre-filter, post-filter, or in-filter—changes with s, a phenomenon modeled as an argmax over a landscape with phase transitions. Selectivity-estimation error causes plan regret, specifically recall loss versus an oracle strategy, concentrated in critical regions around these boundaries. This regret forms a wedge with log-width equal to the multiplicative estimation error epsilon and height proportional to the local cliff |V'(s*)| epsilon. The post-filter cliff occurs at s ~ k/K, while the in-filter cliff is at s_c ~ 0.83/M for graph degree M. Experiments on synthetic sweeps and SIFT1M confirm regret concentrates ~290x at the boundary, with curves obeying a universal finite-size scaling collapse. While real approximate indexes correctly locate boundaries, biased cost models introduce persistent miscalibration that estimation-error robustness cannot fix. This work provides a characterization, not a new index.
Key takeaway
For Machine Learning Engineers optimizing filtered ANN queries, understanding selectivity-estimation error's impact is critical. Your systems will experience ~290x higher recall loss near strategy phase boundaries (e.g., s ~ k/K, s_c ~ 0.83/M). Prioritize improving selectivity estimation accuracy in these critical regions and rigorously validate cost models to prevent persistent miscalibration, which estimation-error robustness cannot fix.
Key insights
Selectivity-estimation error causes significant plan regret in filtered ANN queries, especially near strategy phase boundaries.
Principles
- Optimal ANN query strategy shifts with selectivity.
- Plan regret concentrates at strategy phase boundaries.
- Biased cost models cause persistent miscalibration.
Method
The paper models filtered ANN strategy choice as an argmax over a landscape with phases, identifying boundaries where optimal strategies change.
In practice
- Focus error reduction near s ~ k/K and s_c ~ 0.83/M.
- Validate cost models to avoid persistent miscalibration.
Topics
- Filtered ANN
- Selectivity Estimation
- Query Optimization
- Plan Regret
- Phase Transitions
- Approximate Nearest Neighbor
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.