Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Advanced, long

Summary

A deployed reinforcement learning (RL) system at DoorDash for adapting dispatch objective weights in a three-sided food-delivery marketplace is presented. This system, called OWA-RL, uses a store-level policy to select a discrete multiplier (from {0.8, 0.9, 1.0, 1.1, 1.2}) that shifts the existing combinatorial assignment optimizer's tradeoff between delivery quality (ASAP) and batching efficiency (XCAT). The OWA-RL is trained offline using Double Q-learning with a conservative regularizer and decentralized store-level execution, leveraging delayed regional rewards. A production switchback experiment across approximately 4,000 geographic regions over two weeks demonstrated that the offline-trained policy increased batching by 0.495 percentage points and reduced courier-side time costs (CAT and CWT) without degrading customer-facing delivery quality (ASAP and 20-minute lateness). The system serves hundreds of millions of daily inferences at a 20-second cadence.

Key takeaway

For MLOps Engineers deploying RL in complex logistics, you should consider integrating RL as a constrained control layer over existing optimizers. This approach, demonstrated by DoorDash's OWA-RL system, allows for adaptive objective-weight tuning while preserving operational safeguards. You can achieve efficiency gains, like increased batching and reduced courier costs, without compromising service quality, by leveraging offline training with conservative regularization and robust online experimentation.

Key insights

A deployed RL system adapts dispatch objective weights in a three-sided marketplace, improving efficiency without sacrificing delivery quality.

Principles

Method

The OWA-RL system uses a store-level policy to select a discrete multiplier for an existing combinatorial optimizer's objective. It's trained offline with Double DQN and Conservative Q-Learning using delayed regional rewards.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.