Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce
Summary
eBay Inc. researchers developed a modular two-agent simulation framework for evaluating conversational shopping assistant architectures in e-commerce, overcoming limitations of A/B testing and single-agent generation. This framework pairs an independent buyer agent, configured with personas and missions, with an interchangeable responder integrating a real e-commerce search API. Using 2,011 conversations across 14 persona buckets, four empirical findings emerged. Rolling-window memory outperformed intent-extraction memory by 0.01-0.10 points on all quality metrics and was 35% faster per query. Systematic failure analysis enabled targeted fixes, reducing failure and near-failure rates by 62%. Swapping the LLM backbone from Gemini 2.5 to Llama 3.3 70B resulted in a 0.16-0.45 point quality decrease. Finally, LLM judge selection is critical. SOTA Gemini and Claude models disagreed on 30% of conversations by two or more points despite identical prompts.
Key takeaway
For AI Engineers evaluating conversational shopping assistants, you should implement a two-agent simulation framework for rigorous architectural comparison. This enables controlled testing of memory strategies and LLM backbones, revealing performance trade-offs and facilitating rapid, evidence-driven iteration. Be aware that your LLM judge selection is a critical design decision, as different judges embed distinct evaluation philosophies.
Key insights
eBay's two-agent simulation framework offers controlled, rapid evaluation of conversational shopping assistants, revealing key architectural and LLM judge impacts.
Principles
- Simpler rolling-window memory can outperform complex intent extraction.
- LLM backbone choice independently impacts generative quality.
- LLM judge selection is a critical architectural decision.
Method
A buyer agent with personas and missions interacts with an interchangeable responder integrating a real e-commerce search API, enabling controlled comparison of responder designs.
In practice
- Use two-agent simulation for pre-production architecture testing.
- Prioritize simpler memory designs for conversational agents.
- Conduct systematic failure analysis for rapid iteration.
Topics
- Agentic Search
- E-commerce AI
- LLM Evaluation
- Two-Agent Simulation
- Conversational Memory
- LLM Judges
Best for: Research Scientist, NLP Engineer, AI Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.