ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

ShopGym is an integrated framework designed for realistic simulation and scalable benchmarking of e-commerce web agents. It addresses the fundamental trade-off between realism and experimental control in existing evaluation methodologies. The framework consists of two main components: ShopArena, which converts live seed storefronts into self-contained, anonymized sandbox shops through a staged generation process, and ShopGuru, which synthesizes benchmark tasks grounded in these simulated storefronts across seven skill categories. ShopGym produces stable, resettable, and inspectable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. Validation through graph-based structural analysis and agent-based behavioral evaluation across 224 generated tasks and six sandbox shops (three synthetic, three real-data-based) demonstrates that synthetic shops retain key structural properties and agent performance correlations with live storefronts.

Key takeaway

For research scientists developing and evaluating e-commerce web agents, ShopGym provides a robust solution to the realism-control dilemma. You should consider using this framework to create reproducible and inspectable benchmarks, as it allows for controlled experimentation while preserving the behavioral signals of live storefronts, enabling more reliable comparison and training of next-generation agents.

Key insights

ShopGym offers a scalable framework for creating realistic, controllable, and reproducible e-commerce web agent evaluation environments.

Principles

Method

ShopArena explores live storefronts to create anonymized specifications, then synthesizes sandbox shops via a stepwise code generation loop with execution-verification. ShopGuru then generates grounded tasks, including LLM-authored long-horizon journeys, validated by a polish loop.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.