Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

2026-06-16 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The paper introduces a novel approach to embedding model routing in modern recommendation systems, addressing challenges like adversarial queries, bandit feedback, and limited model observability. It formalizes this as an adversarial contextual linear bandit problem with low-rank experts, where queries are contexts and embedding models are experts operating in low-rank latent spaces. A "log-quadratic" policy class, Π₇₇₇₇, is established as both expressive for query-dependent routing and amenable to efficient online learning. The proposed Hypentropy Policy Gradient (HPG) algorithm provably adapts to unknown low-rank structures, achieving Õ(s√MT) linearized policy regret, thereby avoiding a curse of dimensionality. HPG is also shown to be computationally efficient, with an Õ(d₇²M) per-round runtime, and can be implemented parameter-free. Numerical experiments on the Amazon ESCI dataset demonstrate that Π₇₇₇₇ significantly outperforms log-linear policies, yielding a 68.0% improvement in reducing the sub-optimality gap.

Key takeaway

For ML Engineers and AI Scientists developing recommendation systems or Retrieval-Augmented Generation (RAG) pipelines, you should move beyond simple log-linear routing policies. Implementing log-quadratic policies, particularly with the Hypentropy Policy Gradient (HPG) algorithm, allows you to effectively capitalize on the intrinsic low-rank structure of embedding models. This approach significantly improves routing accuracy and computational efficiency, especially in high-dimensional query spaces (d₇=768), by avoiding the "curse of dimensionality" and outperforming traditional methods.

Key insights

Optimal embedding model routing policies are approximated by low-rank quadratic functions, enabling efficient learning with HPG.

Principles

No single embedding model is uniformly optimal across all query types.
Expected model rewards exhibit a low-rank quadratic structure in queries.
Hypentropy regularization promotes low-rankness in matrix parameter spaces.

Method

Hypentropy Policy Gradient (HPG) uses a hypentropy-regularized Online Mirror Descent with a REINFORCE-style gradient estimator to adaptively learn low-rank quadratic routing policies under bandit feedback.

In practice

Dynamically route queries to specialized embedding models.
Employ low-rank matrix parameterization for high-dimensional query spaces.
Utilize HPG for efficient, parameter-free online learning of routing policies.

Topics

Embedding Model Routing
Contextual Bandits
Low-Rank Matrix Learning
Policy Gradient Algorithms
Recommendation Systems
Online Learning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.