Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, extended

Summary

eBay Inc. researchers developed a modular two-agent simulation framework for evaluating conversational shopping assistant architectures in e-commerce, overcoming limitations of A/B testing and single-agent generation. This framework pairs an independent buyer agent, configured with personas and missions, with an interchangeable responder integrating a real e-commerce search API. Using 2,011 conversations across 14 persona buckets, four empirical findings emerged. Rolling-window memory outperformed intent-extraction memory by 0.01-0.10 points on all quality metrics and was 35% faster per query. Systematic failure analysis enabled targeted fixes, reducing failure and near-failure rates by 62%. Swapping the LLM backbone from Gemini 2.5 to Llama 3.3 70B resulted in a 0.16-0.45 point quality decrease. Finally, LLM judge selection is critical. SOTA Gemini and Claude models disagreed on 30% of conversations by two or more points despite identical prompts.

Key takeaway

For AI Engineers evaluating conversational shopping assistants, you should implement a two-agent simulation framework for rigorous architectural comparison. This enables controlled testing of memory strategies and LLM backbones, revealing performance trade-offs and facilitating rapid, evidence-driven iteration. Be aware that your LLM judge selection is a critical design decision, as different judges embed distinct evaluation philosophies.

Key insights

eBay's two-agent simulation framework offers controlled, rapid evaluation of conversational shopping assistants, revealing key architectural and LLM judge impacts.

Principles

Simpler rolling-window memory can outperform complex intent extraction.
LLM backbone choice independently impacts generative quality.
LLM judge selection is a critical architectural decision.

Method

A buyer agent with personas and missions interacts with an interchangeable responder integrating a real e-commerce search API, enabling controlled comparison of responder designs.

In practice

Use two-agent simulation for pre-production architecture testing.
Prioritize simpler memory designs for conversational agents.
Conduct systematic failure analysis for rapid iteration.

Topics

Agentic Search
E-commerce AI
LLM Evaluation
Two-Agent Simulation
Conversational Memory
LLM Judges

Best for: Research Scientist, NLP Engineer, AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.