ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
Summary
ShopGym is an integrated framework designed for realistic simulation and scalable benchmarking of e-commerce web agents, addressing the limitations of existing evaluation methodologies. It overcomes the trade-off between live storefront realism and sandbox control by providing a scalable way to create evaluation settings that are realistic, diverse, controllable, inspectable, and reproducible. The framework consists of two main components: ShopArena, which converts live seed storefronts into self-contained sandbox shops using anonymized specifications and a validated generation process, and ShopGuru, which synthesizes benchmark tasks across seven skill categories, grounding them in the shop's catalog, navigation, policies, and interaction affordances. ShopGym produces stable, resettable, and inspectable evaluation artifacts that maintain structural properties and agent-evaluation signals relevant to shopping tasks. Validation through graph-based structural analysis and agent-based behavioral evaluation across 224 tasks and six sandbox shops (three synthetic, three real-data) demonstrates that synthetic shops preserve key structural properties of live storefronts, with agent performance positively correlated.
Key takeaway
For research scientists developing and evaluating e-commerce web agents, ShopGym provides a robust solution to the challenges of realism and reproducibility. You should consider using this framework to construct controlled, inspectable, and stable evaluation environments, enabling more reliable comparisons of agent performance. This approach can significantly improve the scientific rigor of your agent development and benchmarking efforts.
Key insights
ShopGym offers a scalable framework for realistic, reproducible e-commerce web agent simulation and benchmarking.
Principles
- Realism and reproducibility are key.
- Structural properties must be preserved.
- Synthetic data can mirror live storefronts.
Method
ShopGym converts live storefronts into self-contained sandbox shops via ShopArena, then synthesizes benchmark tasks across seven skill categories using ShopGuru, grounding tasks in shop specifics.
In practice
- Simulate e-commerce agents in controlled environments.
- Generate diverse benchmark tasks for web agents.
- Validate agent performance using synthetic shops.
Topics
- E-commerce Web Agents
- ShopGym Framework
- Simulation Environments
- Scalable Benchmarking
- ShopArena
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.