Quality Over Clicks: Iterative Reinforcement Learning for Early-Stage E-Commerce Query Suggestion

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Cold-EQS, an iterative reinforcement learning framework developed by Alibaba International Digital Commercial Group, addresses the cold-start problem in e-commerce query suggestion systems. Traditional methods, relying on large language models and Click-Through Rate (CTR) models, struggle without extensive online click data. Cold-EQS overcomes this by using intrinsic quality rewards—answerability, factual accuracy, and information gain—to optimize suggested queries. It employs a multi-phase training strategy, including supervised fine-tuning with Qwen3-4B, reinforcement learning with quality-aware rewards evaluated by Qwen-30B-A3B, and uncertainty-aware sampling to select challenging online queries lacking click signals. The framework achieved a significant +6.81% improvement in online chatUV and superior offline performance on its EQS-Benchmark dataset, demonstrating 86.1% Strict Accuracy and 90.6% Valid Rate. Codes, models, and the 16,949-query benchmark are publicly available.

Key takeaway

For Machine Learning Engineers developing e-commerce conversational AI, you should consider adopting intrinsic quality rewards and uncertainty-aware sampling to overcome cold-start challenges in query suggestion. This approach, demonstrated by Cold-EQS's +6.81% online chatUV improvement, allows for robust model optimization even with sparse click data. You can leverage the publicly available EQS-Benchmark and Cold-EQS code to accelerate your development and ensure high-quality, answerable, and factual query suggestions from the outset.

Key insights

Cold-EQS uses intrinsic quality rewards and uncertainty sampling to enable effective e-commerce query suggestion in cold-start scenarios.

Principles

Intrinsic quality metrics improve cold-start QS.
Uncertainty sampling targets ambiguous online data.
Iterative RL refines models without dense click data.

Method

Cold-EQS fine-tunes a base LLM (Qwen3-4B) with clicked data, then iteratively optimizes it via RL using answerability, factuality, and information gain as rewards, sampling uncertain online queries.

In practice

Implement quality-driven rewards for LLM fine-tuning.
Use uncertainty sampling to prioritize hard examples.
Leverage EQS-Benchmark for offline evaluation.

Topics

E-commerce AI
Query Suggestion
Reinforcement Learning
Cold-Start Problem
LLM Fine-tuning
Uncertainty Sampling
EQS-Benchmark

Code references

QiSun123/Cold-EQS

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.