Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch
Summary
A deployed reinforcement learning (RL) system at DoorDash for adapting dispatch objective weights in a three-sided food-delivery marketplace is presented. This system, called OWA-RL, uses a store-level policy to select a discrete multiplier (from {0.8, 0.9, 1.0, 1.1, 1.2}) that shifts the existing combinatorial assignment optimizer's tradeoff between delivery quality (ASAP) and batching efficiency (XCAT). The OWA-RL is trained offline using Double Q-learning with a conservative regularizer and decentralized store-level execution, leveraging delayed regional rewards. A production switchback experiment across approximately 4,000 geographic regions over two weeks demonstrated that the offline-trained policy increased batching by 0.495 percentage points and reduced courier-side time costs (CAT and CWT) without degrading customer-facing delivery quality (ASAP and 20-minute lateness). The system serves hundreds of millions of daily inferences at a 20-second cadence.
Key takeaway
For MLOps Engineers deploying RL in complex logistics, you should consider integrating RL as a constrained control layer over existing optimizers. This approach, demonstrated by DoorDash's OWA-RL system, allows for adaptive objective-weight tuning while preserving operational safeguards. You can achieve efficiency gains, like increased batching and reduced courier costs, without compromising service quality, by leveraging offline training with conservative regularization and robust online experimentation.
Key insights
A deployed RL system adapts dispatch objective weights in a three-sided marketplace, improving efficiency without sacrificing delivery quality.
Principles
- Modulating existing optimizers with RL preserves operational safeguards.
- Offline RL with conservative regularization enhances training stability.
- Regional reward aggregation captures marketplace network effects.
Method
The OWA-RL system uses a store-level policy to select a discrete multiplier for an existing combinatorial optimizer's objective. It's trained offline with Double DQN and Conservative Q-Learning using delayed regional rewards.
In practice
- Implement RL as a constrained control layer over existing optimizers.
- Use Double DQN with CQL for robust offline training stability.
- Monitor state and action distributions for policy and marketplace drift.
Topics
- Multi-Agent Reinforcement Learning
- Offline Reinforcement Learning
- Marketplace Dispatch
- Objective-Weight Adaptation
- Double Q-learning
- Switchback Experiments
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.