Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch
Summary
DoorDash has deployed a multi-agent reinforcement learning system that adapts dispatch objective weights within its large-scale three-sided food-delivery marketplace. This system leverages delayed operational feedback, such as delivery speed and courier utilization, to optimize dispatch decisions. Instead of replacing the existing combinatorial assignment optimizer, the system employs a store-level policy to select a discrete multiplier. This multiplier adjusts the optimizer's balance between delivery quality and batching efficiency. The approach facilitates offline policy learning despite noisy, delayed, and coupled feedback, while maintaining production feasibility and operational safeguards. The system trains a shared value function using centralized offline data and decentralized store-level execution, incorporating Double Q-learning targets and a conservative regularizer to mitigate out-of-distribution value overestimation. A production switchback experiment confirmed the offline-trained policy's effectiveness, demonstrating increased batching and reduced courier-side time costs without compromising customer-facing delivery quality.
Key takeaway
For MLOps Engineers deploying AI in complex logistics or marketplace systems, this approach offers a robust strategy. You should consider adapting existing optimizers with learned policy multipliers rather than full replacement. This preserves operational constraints while allowing RL to optimize dynamic tradeoffs. Implement conservative offline training with techniques like Double Q-learning and validate with production switchback experiments to ensure safe, effective deployment and cost reduction without degrading service quality.
Key insights
DoorDash uses RL to adapt dispatch objective weights in a three-sided marketplace, balancing delivery quality and batching efficiency.
Principles
- Adapt dispatch objective weights using delayed feedback.
- Preserve existing optimizers with policy multipliers.
- Use conservative regularization for offline RL.
Method
A store-level policy selects a discrete multiplier for the dispatch optimizer. It's trained offline using centralized data, decentralized execution, Double Q-learning, and a conservative regularizer.
In practice
- Implement RL for marketplace objective adaptation.
- Integrate RL with existing combinatorial optimizers.
- Use switchback experiments for production validation.
Topics
- Multi-Agent Reinforcement Learning
- Three-Sided Marketplaces
- Dispatch Optimization
- Objective Weight Adaptation
- Offline Reinforcement Learning
- DoorDash
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.