Online Learning and Equilibrium Computation with Ranking Feedback

2026-03-19 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

This research introduces an online learning model where a learner receives only ranking feedback over proposed actions at each timestep, departing from traditional numeric utility feedback. The study examines two ranking mechanisms: instantaneous utility and time-average utility, across both full-information and bandit feedback settings. It demonstrates that sublinear regret is generally impossible with instantaneous-utility ranking feedback. Furthermore, under the Plackett-Luce model with a sufficiently small temperature, sublinear regret is also impossible with time-average utility ranking feedback. The authors develop new algorithms that achieve sublinear regret when the utility sequence exhibits sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption is not required. The paper concludes by showing that these algorithms lead to an approximate coarse correlated equilibrium in repeated normal-form games and validates their effectiveness in an online large-language-model routing task.

Key takeaway

For AI Researchers developing online learning systems with human-in-the-loop components or privacy constraints, you should consider implementing time-average utility ranking feedback mechanisms. This approach, especially with full-information, can enable sublinear regret and facilitate convergence to approximate coarse correlated equilibria in multi-agent systems, offering a viable alternative to numeric utility feedback.

Key insights

Online learning with ranking feedback can achieve sublinear regret under specific utility and feedback conditions.

Principles

Sublinear regret is generally impossible with instantaneous-utility ranking.
Time-average utility ranking can enable sublinear regret.

Method

New algorithms achieve sublinear regret by assuming sublinear total variation in utility sequences, or without it for full-information time-average feedback.

In practice

Apply to online large-language-model routing tasks.
Use in game theory for approximate coarse correlated equilibrium.

Topics

Online Learning
Ranking Feedback
Regret Minimization
Game Theory
Large Language Models

Best for: AI Researcher, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.