Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries
Summary
MO-PQUCB is a novel hybrid algorithm designed for personalized multi-objective bandits, addressing the challenge of learning user-specific trade-offs among competing objectives. Existing methods infer user preferences solely from utility feedback, which conflates preference learning with reward exploration. This new framework formalizes the use of proactive conversational queries (e.g., "cheap and clean hotel") as structured preference signals, which are typically unutilized. By modeling these signals with a Plackett-Luce subset choice model, the research identifies a fundamental shift-invariance barrier, indicating that query-only learning is insufficient. MO-PQUCB resolves this by integrating query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. The algorithm demonstrates provably accelerated preference estimation and improved regret scaling over previous preference-aware multi-objective multi-armed bandit (MO-MAB) methods. Furthermore, it characterizes statistical limits under corrupted queries and provides a robust estimator for sparse corruption, validated by experiments.
Key takeaway
For Machine Learning Engineers developing personalized recommendation or decision systems, integrating proactive user queries is crucial. You should consider adopting a hybrid approach like MO-PQUCB to explicitly anchor user preferences, rather than solely relying on implicit utility feedback. This method offers provably accelerated preference estimation and improved regret scaling, leading to more efficient and accurate personalized experiences, even when dealing with potentially corrupted query data.
Key insights
Proactive user queries in multi-objective bandits significantly improve preference learning and regret scaling by decoupling it from reward exploration.
Principles
- Proactive queries offer structured preference signals.
- Query-only preference learning faces shift-invariance barrier.
- Hybrid approach improves regret scaling.
Method
MO-PQUCB integrates query-based preference anchoring with bandit feedback using shift-invariant regularization and dual-exploration UCB. It models queries via a Plackett-Luce subset choice model and includes a robust estimator for sparse query corruption.
In practice
- Incorporate user queries for better personalization.
- Design systems to capture explicit preference signals.
- Use MO-PQUCB for improved MO-MAB performance.
Topics
- Multi-Objective Bandits
- Personalized Decision Making
- Proactive Conversational Queries
- Preference Learning
- MO-PQUCB Algorithm
- Regret Scaling
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.