Calibration-Gated LLM Pseudo-Observations for Online Contextual Bandits

2026-04-16 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new method, Calibration-Gated LLM Pseudo-Observations, enhances Disjoint LinUCB contextual bandit algorithms by injecting large language model (LLM) predicted counterfactual rewards for unplayed arms. After each round, an LLM generates these predictions, which are then weighted and incorporated into the learner. The injection weight is dynamically adjusted by a calibration-gated decay schedule that monitors the LLM's prediction accuracy on played arms using an exponential moving average. This mechanism suppresses LLM influence during high calibration error and increases weight for accurate predictions, particularly in early rounds. Evaluated on UCI Mushroom and MIND-small (a 5-arm news recommendation task), the approach reduced cumulative regret by 19% on MIND when using a task-specific prompt, compared to pure LinUCB. However, generic counterfactual prompt framing increased regret, highlighting prompt design as the dominant factor.

Key takeaway

For research scientists developing contextual bandit algorithms, consider integrating LLM pseudo-observations to mitigate cold-start regret. Focus intensely on crafting task-specific prompts, as generic prompts can degrade performance. Implement a calibration-gated decay schedule to dynamically adjust the LLM's influence based on its real-time prediction accuracy, ensuring its contributions are beneficial during critical early learning phases.

Key insights

LLM pseudo-observations can reduce cold-start regret in contextual bandits, but prompt design is critical.

Principles

Calibration gating controls LLM influence.
Prompt design dominates LLM pseudo-observation efficacy.

Method

Augment Disjoint LinUCB with LLM-predicted counterfactual rewards for unplayed arms, weighting them via a calibration-gated decay schedule based on LLM accuracy on played arms.

In practice

Use task-specific prompts for LLM pseudo-observations.
Monitor LLM prediction accuracy for dynamic weighting.

Topics

Contextual Bandits
LLM Pseudo-Observations
Disjoint LinUCB
Calibration Gating
Prompt Design

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.