How DoorDash Built a Testing System to Evaluate LLMs
Summary
DoorDash developed a "simulation and evaluation flywheel" to address subtle hallucination issues in its customer support LLM chatbot, which handles hundreds of thousands of daily contacts. This system comprises an offline simulator that generates realistic multi-turn customer conversations using an LLM to role-play customers based on historical data and mock backend information. The second component is an automated evaluation framework, employing an LLM as a judge to grade chatbot performance against specific policy-based criteria, calibrated with human expert judgment. This flywheel enables rapid iteration, reducing testing cycles from days to hours, running over 200 simulated conversations in under five minutes. A key architectural improvement, the "case state" layer, synthesized raw tool history into structured context, leading to a 90% reduction in hallucinations in simulation and improved production performance.
Key takeaway
For AI Engineers building or maintaining LLM-powered customer support systems, traditional deterministic testing methods are insufficient due to LLM non-determinism. You should implement an automated "simulation and evaluation flywheel", similar to DoorDash's, to rapidly test changes offline. This approach allows you to iterate on prompts and architectures, validate performance across hundreds of scenarios, and catch regressions before impacting real customers, significantly reducing hallucination rates and deployment risks.
Key insights
LLM systems require a rapid, automated "simulation and evaluation flywheel" to manage non-determinism and ensure quality before deployment.
Principles
- LLM non-determinism demands dedicated testing paradigms.
- "LLM-as-judge" excels at narrow, binary policy checks.
- Synthesizing context can reduce LLM hallucinations.
Method
Implement a simulation and evaluation flywheel: define failure modes, create "LLM-as-judge" evaluations, generate multi-turn scenarios from historical data, simulate, and iterate on model context or prompts until pass rates meet criteria.
In practice
- Employ LLMs to dynamically simulate customer interactions.
- Calibrate "LLM-as-judge" with human expert labels.
- Distill raw context into structured "case state" for LLMs.
Topics
- LLM Testing
- Chatbot Evaluation
- AI Hallucinations
- Simulation Frameworks
- MLOps
- Context Management
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.