Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Summary
This paper introduces a practical blueprint for evaluating and optimizing conversational shopping assistants (CSAs), specifically focusing on multi-agent systems like the production-scale AI grocery assistant, MAGIC. It details a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions: Shopping Execution, Personalization, Conversation Quality, and Safety. The authors develop a calibrated LLM-as-judge pipeline, achieving 91.4% agreement with human annotations after GEPA prompt optimization. Building on this, the paper investigates two prompt-optimization strategies: Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and MAMuT (Multi-Agent Multi-Turn) GEPA, a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. MAMuT GEPA significantly improved overall rubric pass rates from 77.1% to 84.7% on a held-out set of 238 trajectories, outperforming Sub-agent GEPA, especially in Safety & Compliance (+12.0%) and Conversational Quality (+8.0%).
Key takeaway
For AI Architects and NLP Engineers building multi-agent conversational systems, your evaluation and optimization strategy must extend beyond individual agent performance. You should adopt a holistic, trajectory-level evaluation rubric and implement system-level prompt optimization like MAMuT GEPA. This approach is critical for addressing complex coordination failures, improving overall system quality, and ensuring robust performance in areas like safety, compliance, and conversational flow, which local optimizations often miss.
Key insights
System-level prompt optimization is crucial for multi-agent conversational AI, surpassing individual agent tuning.
Principles
- Decompose complex system quality into structured, verifiable dimensions.
- Calibrate LLM-as-judge pipelines against human annotations for reliability.
- Local optimization often fails to resolve systemic coordination issues.
Method
The proposed method involves a multi-faceted rubric, a calibrated LLM-as-judge, and two prompt optimization strategies: Sub-agent GEPA for individual nodes and MAMuT GEPA for joint, system-level optimization using simulated user interactions.
In practice
- Use a structured rubric for multi-turn conversational AI evaluation.
- Implement LLM-as-judge for scalable, consistent evaluation.
- Prioritize system-level prompt optimization for multi-agent coordination.
Topics
- Multi-Agent Systems
- Conversational AI
- LLM-as-Judge
- Prompt Optimization
- MAMuT GEPA
Best for: AI Architect, NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.