Filtered ANN as a Phase Transition: When Selectivity-Estimation Error Causes Plan Regret

2026-06-15 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A filtered approximate-nearest-neighbor (ANN) query returns the k nearest vectors among those satisfying an attribute predicate P of selectivity s. The optimal execution strategy—pre-filter, post-filter, or in-filter—changes with s, a phenomenon modeled as an argmax over a landscape with phase transitions. Selectivity-estimation error causes plan regret, specifically recall loss versus an oracle strategy, concentrated in critical regions around these boundaries. This regret forms a wedge with log-width equal to the multiplicative estimation error epsilon and height proportional to the local cliff |V'(s*)| epsilon. The post-filter cliff occurs at s ~ k/K, while the in-filter cliff is at s_c ~ 0.83/M for graph degree M. Experiments on synthetic sweeps and SIFT1M confirm regret concentrates ~290x at the boundary, with curves obeying a universal finite-size scaling collapse. While real approximate indexes correctly locate boundaries, biased cost models introduce persistent miscalibration that estimation-error robustness cannot fix. This work provides a characterization, not a new index.

Key takeaway

For Machine Learning Engineers optimizing filtered ANN queries, understanding selectivity-estimation error's impact is critical. Your systems will experience ~290x higher recall loss near strategy phase boundaries (e.g., s ~ k/K, s_c ~ 0.83/M). Prioritize improving selectivity estimation accuracy in these critical regions and rigorously validate cost models to prevent persistent miscalibration, which estimation-error robustness cannot fix.

Key insights

Selectivity-estimation error causes significant plan regret in filtered ANN queries, especially near strategy phase boundaries.

Principles

Optimal ANN query strategy shifts with selectivity.
Plan regret concentrates at strategy phase boundaries.
Biased cost models cause persistent miscalibration.

Method

The paper models filtered ANN strategy choice as an argmax over a landscape with phases, identifying boundaries where optimal strategies change.

In practice

Focus error reduction near s ~ k/K and s_c ~ 0.83/M.
Validate cost models to avoid persistent miscalibration.

Topics

Filtered ANN
Selectivity Estimation
Query Optimization
Plan Regret
Phase Transitions
Approximate Nearest Neighbor

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.