Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts
Summary
The paper introduces a novel approach to embedding model routing in modern recommendation systems, addressing challenges like adversarial queries, bandit feedback, and limited model observability. It formalizes this as an adversarial contextual linear bandit problem with low-rank experts, where queries are contexts and embedding models are experts operating in low-rank latent spaces. A "log-quadratic" policy class, Π₇₇₇₇, is established as both expressive for query-dependent routing and amenable to efficient online learning. The proposed Hypentropy Policy Gradient (HPG) algorithm provably adapts to unknown low-rank structures, achieving Õ(s√MT) linearized policy regret, thereby avoiding a curse of dimensionality. HPG is also shown to be computationally efficient, with an Õ(d₇²M) per-round runtime, and can be implemented parameter-free. Numerical experiments on the Amazon ESCI dataset demonstrate that Π₇₇₇₇ significantly outperforms log-linear policies, yielding a 68.0% improvement in reducing the sub-optimality gap.
Key takeaway
For ML Engineers and AI Scientists developing recommendation systems or Retrieval-Augmented Generation (RAG) pipelines, you should move beyond simple log-linear routing policies. Implementing log-quadratic policies, particularly with the Hypentropy Policy Gradient (HPG) algorithm, allows you to effectively capitalize on the intrinsic low-rank structure of embedding models. This approach significantly improves routing accuracy and computational efficiency, especially in high-dimensional query spaces (d₇=768), by avoiding the "curse of dimensionality" and outperforming traditional methods.
Key insights
Optimal embedding model routing policies are approximated by low-rank quadratic functions, enabling efficient learning with HPG.
Principles
- No single embedding model is uniformly optimal across all query types.
- Expected model rewards exhibit a low-rank quadratic structure in queries.
- Hypentropy regularization promotes low-rankness in matrix parameter spaces.
Method
Hypentropy Policy Gradient (HPG) uses a hypentropy-regularized Online Mirror Descent with a REINFORCE-style gradient estimator to adaptively learn low-rank quadratic routing policies under bandit feedback.
In practice
- Dynamically route queries to specialized embedding models.
- Employ low-rank matrix parameterization for high-dimensional query spaces.
- Utilize HPG for efficient, parameter-free online learning of routing policies.
Topics
- Embedding Model Routing
- Contextual Bandits
- Low-Rank Matrix Learning
- Policy Gradient Algorithms
- Recommendation Systems
- Online Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.