ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, E-commerce & Digital Commerce · Depth: Expert, extended

Summary

ShoppingBench is a new end-to-end shopping benchmark designed to evaluate large language model (LLM) agents on complex, real-world e-commerce intents beyond basic product finding. It features 3,310 user instructions across four progressively challenging intent types: Products Finder, Knowledge, Multi-products seller, and Coupon & Budget. The benchmark utilizes a large-scale interactive shopping sandbox with over 2.5 million real-world products from Lazada.com. Experimental results show that even advanced agents like GPT-4.1 achieve an Absolute Success Rate (ASR) under 50%, dropping to 30.4% for Coupon & Budget tasks. A proposed trajectory distillation strategy, combining supervised fine-tuning (SFT) and reinforcement learning (RL) on synthetic data, enabled a smaller Qwen3-4B agent to achieve competitive performance, surpassing GPT-4.1 by 0.5% ASR.

Key takeaway

For AI Scientists and ML Engineers developing e-commerce agents, you should prioritize evaluating models against complex, multi-step user intents like budget optimization and multi-seller purchases, rather than just basic product finding. Consider implementing trajectory distillation with supervised fine-tuning and reinforcement learning to efficiently train smaller, high-performing agents. This approach can bridge the performance gap with larger models and address the limitations revealed by benchmarks like ShoppingBench.

Key insights

Real-world e-commerce user intents are complex, posing significant challenges for current LLM-based agents.

Principles

Method

ShoppingBench constructs a benchmark by simulating user instructions from real-world products, providing a 2.5 million-product interactive sandbox, and defining intent-grounded evaluation metrics. Agent training uses GPT-4.1 generated, rejection-sampled trajectories for SFT and RL.

In practice

Topics

Code references

Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.