ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents
Summary
ShoppingBench is a new end-to-end shopping benchmark designed to evaluate large language model (LLM) agents on complex, real-world e-commerce intents beyond basic product finding. It features 3,310 user instructions across four progressively challenging intent types: Products Finder, Knowledge, Multi-products seller, and Coupon & Budget. The benchmark utilizes a large-scale interactive shopping sandbox with over 2.5 million real-world products from Lazada.com. Experimental results show that even advanced agents like GPT-4.1 achieve an Absolute Success Rate (ASR) under 50%, dropping to 30.4% for Coupon & Budget tasks. A proposed trajectory distillation strategy, combining supervised fine-tuning (SFT) and reinforcement learning (RL) on synthetic data, enabled a smaller Qwen3-4B agent to achieve competitive performance, surpassing GPT-4.1 by 0.5% ASR.
Key takeaway
For AI Scientists and ML Engineers developing e-commerce agents, you should prioritize evaluating models against complex, multi-step user intents like budget optimization and multi-seller purchases, rather than just basic product finding. Consider implementing trajectory distillation with supervised fine-tuning and reinforcement learning to efficiently train smaller, high-performing agents. This approach can bridge the performance gap with larger models and address the limitations revealed by benchmarks like ShoppingBench.
Key insights
Real-world e-commerce user intents are complex, posing significant challenges for current LLM-based agents.
Principles
- E-commerce agent benchmarks need complex, grounded user intents.
- Trajectory distillation can transfer large agent capabilities to smaller models.
- External tool integration is crucial for long-tail domain knowledge.
Method
ShoppingBench constructs a benchmark by simulating user instructions from real-world products, providing a 2.5 million-product interactive sandbox, and defining intent-grounded evaluation metrics. Agent training uses GPT-4.1 generated, rejection-sampled trajectories for SFT and RL.
In practice
- Design agents for multi-step reasoning and tool use.
- Filter synthetic trajectories for high-quality training data.
- Incorporate web search for knowledge-intensive tasks.
Topics
- ShoppingBench
- LLM Agents
- E-commerce AI
- Agent Benchmarking
- Trajectory Distillation
- Supervised Fine-Tuning
- Reinforcement Learning
Code references
Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.