Quality Over Clicks: Iterative Reinforcement Learning for Early-Stage E-Commerce Query Suggestion

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Cold-EQS, an iterative reinforcement learning framework developed by Alibaba International Digital Commercial Group, addresses the cold-start problem in e-commerce query suggestion systems. Traditional methods, relying on large language models and Click-Through Rate (CTR) models, struggle without extensive online click data. Cold-EQS overcomes this by using intrinsic quality rewards—answerability, factual accuracy, and information gain—to optimize suggested queries. It employs a multi-phase training strategy, including supervised fine-tuning with Qwen3-4B, reinforcement learning with quality-aware rewards evaluated by Qwen-30B-A3B, and uncertainty-aware sampling to select challenging online queries lacking click signals. The framework achieved a significant +6.81% improvement in online chatUV and superior offline performance on its EQS-Benchmark dataset, demonstrating 86.1% Strict Accuracy and 90.6% Valid Rate. Codes, models, and the 16,949-query benchmark are publicly available.

Key takeaway

For Machine Learning Engineers developing e-commerce conversational AI, you should consider adopting intrinsic quality rewards and uncertainty-aware sampling to overcome cold-start challenges in query suggestion. This approach, demonstrated by Cold-EQS's +6.81% online chatUV improvement, allows for robust model optimization even with sparse click data. You can leverage the publicly available EQS-Benchmark and Cold-EQS code to accelerate your development and ensure high-quality, answerable, and factual query suggestions from the outset.

Key insights

Cold-EQS uses intrinsic quality rewards and uncertainty sampling to enable effective e-commerce query suggestion in cold-start scenarios.

Principles

Method

Cold-EQS fine-tunes a base LLM (Qwen3-4B) with clicked data, then iteratively optimizes it via RL using answerability, factuality, and information gain as rewards, sampling uncertain online queries.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.