Quality Over Clicks: Iterative Reinforcement Learning for Early-Stage E-Commerce Query Suggestion
Summary
Cold-EQS, an iterative reinforcement learning framework developed by Alibaba International Digital Commercial Group, addresses the cold-start problem in e-commerce query suggestion systems. Traditional methods, relying on large language models and Click-Through Rate (CTR) models, struggle without extensive online click data. Cold-EQS overcomes this by using intrinsic quality rewards—answerability, factual accuracy, and information gain—to optimize suggested queries. It employs a multi-phase training strategy, including supervised fine-tuning with Qwen3-4B, reinforcement learning with quality-aware rewards evaluated by Qwen-30B-A3B, and uncertainty-aware sampling to select challenging online queries lacking click signals. The framework achieved a significant +6.81% improvement in online chatUV and superior offline performance on its EQS-Benchmark dataset, demonstrating 86.1% Strict Accuracy and 90.6% Valid Rate. Codes, models, and the 16,949-query benchmark are publicly available.
Key takeaway
For Machine Learning Engineers developing e-commerce conversational AI, you should consider adopting intrinsic quality rewards and uncertainty-aware sampling to overcome cold-start challenges in query suggestion. This approach, demonstrated by Cold-EQS's +6.81% online chatUV improvement, allows for robust model optimization even with sparse click data. You can leverage the publicly available EQS-Benchmark and Cold-EQS code to accelerate your development and ensure high-quality, answerable, and factual query suggestions from the outset.
Key insights
Cold-EQS uses intrinsic quality rewards and uncertainty sampling to enable effective e-commerce query suggestion in cold-start scenarios.
Principles
- Intrinsic quality metrics improve cold-start QS.
- Uncertainty sampling targets ambiguous online data.
- Iterative RL refines models without dense click data.
Method
Cold-EQS fine-tunes a base LLM (Qwen3-4B) with clicked data, then iteratively optimizes it via RL using answerability, factuality, and information gain as rewards, sampling uncertain online queries.
In practice
- Implement quality-driven rewards for LLM fine-tuning.
- Use uncertainty sampling to prioritize hard examples.
- Leverage EQS-Benchmark for offline evaluation.
Topics
- E-commerce AI
- Query Suggestion
- Reinforcement Learning
- Cold-Start Problem
- LLM Fine-tuning
- Uncertainty Sampling
- EQS-Benchmark
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.