Filtered ANN as a Phase Transition: When Selectivity-Estimation Error Causes Plan Regret

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A filtered approximate-nearest-neighbor (ANN) query returns the k nearest vectors among those satisfying an attribute predicate P of selectivity s. The optimal execution strategy—pre-filter, post-filter, or in-filter—changes with s, a phenomenon modeled as an argmax over a landscape with phase transitions. Selectivity-estimation error causes plan regret, specifically recall loss versus an oracle strategy, concentrated in critical regions around these boundaries. This regret forms a wedge with log-width equal to the multiplicative estimation error epsilon and height proportional to the local cliff |V'(s*)| epsilon. The post-filter cliff occurs at s ~ k/K, while the in-filter cliff is at s_c ~ 0.83/M for graph degree M. Experiments on synthetic sweeps and SIFT1M confirm regret concentrates ~290x at the boundary, with curves obeying a universal finite-size scaling collapse. While real approximate indexes correctly locate boundaries, biased cost models introduce persistent miscalibration that estimation-error robustness cannot fix. This work provides a characterization, not a new index.

Key takeaway

For Machine Learning Engineers optimizing filtered ANN queries, understanding selectivity-estimation error's impact is critical. Your systems will experience ~290x higher recall loss near strategy phase boundaries (e.g., s ~ k/K, s_c ~ 0.83/M). Prioritize improving selectivity estimation accuracy in these critical regions and rigorously validate cost models to prevent persistent miscalibration, which estimation-error robustness cannot fix.

Key insights

Selectivity-estimation error causes significant plan regret in filtered ANN queries, especially near strategy phase boundaries.

Principles

Method

The paper models filtered ANN strategy choice as an argmax over a landscape with phases, identifying boundaries where optimal strategies change.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.