Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Marketplace Optimization · Depth: Expert, quick

Summary

DoorDash has deployed a multi-agent reinforcement learning system that adapts dispatch objective weights within its large-scale three-sided food-delivery marketplace. This system leverages delayed operational feedback, such as delivery speed and courier utilization, to optimize dispatch decisions. Instead of replacing the existing combinatorial assignment optimizer, the system employs a store-level policy to select a discrete multiplier. This multiplier adjusts the optimizer's balance between delivery quality and batching efficiency. The approach facilitates offline policy learning despite noisy, delayed, and coupled feedback, while maintaining production feasibility and operational safeguards. The system trains a shared value function using centralized offline data and decentralized store-level execution, incorporating Double Q-learning targets and a conservative regularizer to mitigate out-of-distribution value overestimation. A production switchback experiment confirmed the offline-trained policy's effectiveness, demonstrating increased batching and reduced courier-side time costs without compromising customer-facing delivery quality.

Key takeaway

For MLOps Engineers deploying AI in complex logistics or marketplace systems, this approach offers a robust strategy. You should consider adapting existing optimizers with learned policy multipliers rather than full replacement. This preserves operational constraints while allowing RL to optimize dynamic tradeoffs. Implement conservative offline training with techniques like Double Q-learning and validate with production switchback experiments to ensure safe, effective deployment and cost reduction without degrading service quality.

Key insights

DoorDash uses RL to adapt dispatch objective weights in a three-sided marketplace, balancing delivery quality and batching efficiency.

Principles

Method

A store-level policy selects a discrete multiplier for the dispatch optimizer. It's trained offline using centralized data, decentralized execution, Double Q-learning, and a conservative regularizer.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.