ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
Summary
ShopGym is an integrated framework designed for realistic simulation and scalable benchmarking of e-commerce web agents. It addresses the fundamental trade-off between realism and experimental control in existing evaluation methodologies. The framework consists of two main components: ShopArena, which converts live seed storefronts into self-contained, anonymized sandbox shops through a staged generation process, and ShopGuru, which synthesizes benchmark tasks grounded in these simulated storefronts across seven skill categories. ShopGym produces stable, resettable, and inspectable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. Validation through graph-based structural analysis and agent-based behavioral evaluation across 224 generated tasks and six sandbox shops (three synthetic, three real-data-based) demonstrates that synthetic shops retain key structural properties and agent performance correlations with live storefronts.
Key takeaway
For research scientists developing and evaluating e-commerce web agents, ShopGym provides a robust solution to the realism-control dilemma. You should consider using this framework to create reproducible and inspectable benchmarks, as it allows for controlled experimentation while preserving the behavioral signals of live storefronts, enabling more reliable comparison and training of next-generation agents.
Key insights
ShopGym offers a scalable framework for creating realistic, controllable, and reproducible e-commerce web agent evaluation environments.
Principles
- Separate storefront exploration from sandbox synthesis.
- Ground synthetic environments in real-world structural data.
- Use multi-agent systems for robust environment generation.
Method
ShopArena explores live storefronts to create anonymized specifications, then synthesizes sandbox shops via a stepwise code generation loop with execution-verification. ShopGuru then generates grounded tasks, including LLM-authored long-horizon journeys, validated by a polish loop.
In practice
- Convert live e-commerce sites into stable sandbox environments.
- Generate diverse, complex shopping tasks for agent evaluation.
- Use LLM-authored tasks with validation loops to ensure feasibility.
Topics
- E-commerce Web Agents
- Realistic Simulation
- Scalable Benchmarking
- ShopGym Framework
- ShopArena
Best for: Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.