Online Learning and Equilibrium Computation with Ranking Feedback
Summary
This research introduces an online learning model where a learner receives only ranking feedback over proposed actions at each timestep, departing from traditional numeric utility feedback. The study examines two ranking mechanisms: instantaneous utility and time-average utility, across both full-information and bandit feedback settings. It demonstrates that sublinear regret is generally impossible with instantaneous-utility ranking feedback. Furthermore, under the Plackett-Luce model with a sufficiently small temperature, sublinear regret is also impossible with time-average utility ranking feedback. The authors develop new algorithms that achieve sublinear regret when the utility sequence exhibits sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption is not required. The paper concludes by showing that these algorithms lead to an approximate coarse correlated equilibrium in repeated normal-form games and validates their effectiveness in an online large-language-model routing task.
Key takeaway
For AI Researchers developing online learning systems with human-in-the-loop components or privacy constraints, you should consider implementing time-average utility ranking feedback mechanisms. This approach, especially with full-information, can enable sublinear regret and facilitate convergence to approximate coarse correlated equilibria in multi-agent systems, offering a viable alternative to numeric utility feedback.
Key insights
Online learning with ranking feedback can achieve sublinear regret under specific utility and feedback conditions.
Principles
- Sublinear regret is generally impossible with instantaneous-utility ranking.
- Time-average utility ranking can enable sublinear regret.
Method
New algorithms achieve sublinear regret by assuming sublinear total variation in utility sequences, or without it for full-information time-average feedback.
In practice
- Apply to online large-language-model routing tasks.
- Use in game theory for approximate coarse correlated equilibrium.
Topics
- Online Learning
- Ranking Feedback
- Regret Minimization
- Game Theory
- Large Language Models
Best for: AI Researcher, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.