How DoorDash Built a Testing System to Evaluate LLMs

2025-12-15 · Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

DoorDash developed a "simulation and evaluation flywheel" to address subtle hallucination issues in its customer support LLM chatbot, which handles hundreds of thousands of daily contacts. This system comprises an offline simulator that generates realistic multi-turn customer conversations using an LLM to role-play customers based on historical data and mock backend information. The second component is an automated evaluation framework, employing an LLM as a judge to grade chatbot performance against specific policy-based criteria, calibrated with human expert judgment. This flywheel enables rapid iteration, reducing testing cycles from days to hours, running over 200 simulated conversations in under five minutes. A key architectural improvement, the "case state" layer, synthesized raw tool history into structured context, leading to a 90% reduction in hallucinations in simulation and improved production performance.

Key takeaway

For AI Engineers building or maintaining LLM-powered customer support systems, traditional deterministic testing methods are insufficient due to LLM non-determinism. You should implement an automated "simulation and evaluation flywheel", similar to DoorDash's, to rapidly test changes offline. This approach allows you to iterate on prompts and architectures, validate performance across hundreds of scenarios, and catch regressions before impacting real customers, significantly reducing hallucination rates and deployment risks.

Key insights

LLM systems require a rapid, automated "simulation and evaluation flywheel" to manage non-determinism and ensure quality before deployment.

Principles

LLM non-determinism demands dedicated testing paradigms.
"LLM-as-judge" excels at narrow, binary policy checks.
Synthesizing context can reduce LLM hallucinations.

Method

Implement a simulation and evaluation flywheel: define failure modes, create "LLM-as-judge" evaluations, generate multi-turn scenarios from historical data, simulate, and iterate on model context or prompts until pass rates meet criteria.

In practice

Employ LLMs to dynamically simulate customer interactions.
Calibrate "LLM-as-judge" with human expert labels.
Distill raw context into structured "case state" for LLMs.

Topics

LLM Testing
Chatbot Evaluation
AI Hallucinations
Simulation Frameworks
MLOps
Context Management

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.