Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries

2026-06-07 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

MO-PQUCB is a novel hybrid algorithm designed for personalized multi-objective bandits, addressing the challenge of learning user-specific trade-offs among competing objectives. Existing methods infer user preferences solely from utility feedback, which conflates preference learning with reward exploration. This new framework formalizes the use of proactive conversational queries (e.g., "cheap and clean hotel") as structured preference signals, which are typically unutilized. By modeling these signals with a Plackett-Luce subset choice model, the research identifies a fundamental shift-invariance barrier, indicating that query-only learning is insufficient. MO-PQUCB resolves this by integrating query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. The algorithm demonstrates provably accelerated preference estimation and improved regret scaling over previous preference-aware multi-objective multi-armed bandit (MO-MAB) methods. Furthermore, it characterizes statistical limits under corrupted queries and provides a robust estimator for sparse corruption, validated by experiments.

Key takeaway

For Machine Learning Engineers developing personalized recommendation or decision systems, integrating proactive user queries is crucial. You should consider adopting a hybrid approach like MO-PQUCB to explicitly anchor user preferences, rather than solely relying on implicit utility feedback. This method offers provably accelerated preference estimation and improved regret scaling, leading to more efficient and accurate personalized experiences, even when dealing with potentially corrupted query data.

Key insights

Proactive user queries in multi-objective bandits significantly improve preference learning and regret scaling by decoupling it from reward exploration.

Principles

Proactive queries offer structured preference signals.
Query-only preference learning faces shift-invariance barrier.
Hybrid approach improves regret scaling.

Method

MO-PQUCB integrates query-based preference anchoring with bandit feedback using shift-invariant regularization and dual-exploration UCB. It models queries via a Plackett-Luce subset choice model and includes a robust estimator for sparse query corruption.

In practice

Incorporate user queries for better personalization.
Design systems to capture explicit preference signals.
Use MO-PQUCB for improved MO-MAB performance.

Topics

Multi-Objective Bandits
Personalized Decision Making
Proactive Conversational Queries
Preference Learning
MO-PQUCB Algorithm
Regret Scaling

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.