Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Advanced, long

Summary

A deployed reinforcement learning (RL) system at DoorDash for adapting dispatch objective weights in a three-sided food-delivery marketplace is presented. This system, called OWA-RL, uses a store-level policy to select a discrete multiplier (from {0.8, 0.9, 1.0, 1.1, 1.2}) that shifts the existing combinatorial assignment optimizer's tradeoff between delivery quality (ASAP) and batching efficiency (XCAT). The OWA-RL is trained offline using Double Q-learning with a conservative regularizer and decentralized store-level execution, leveraging delayed regional rewards. A production switchback experiment across approximately 4,000 geographic regions over two weeks demonstrated that the offline-trained policy increased batching by 0.495 percentage points and reduced courier-side time costs (CAT and CWT) without degrading customer-facing delivery quality (ASAP and 20-minute lateness). The system serves hundreds of millions of daily inferences at a 20-second cadence.

Key takeaway

For MLOps Engineers deploying RL in complex logistics, you should consider integrating RL as a constrained control layer over existing optimizers. This approach, demonstrated by DoorDash's OWA-RL system, allows for adaptive objective-weight tuning while preserving operational safeguards. You can achieve efficiency gains, like increased batching and reduced courier costs, without compromising service quality, by leveraging offline training with conservative regularization and robust online experimentation.

Key insights

A deployed RL system adapts dispatch objective weights in a three-sided marketplace, improving efficiency without sacrificing delivery quality.

Principles

Modulating existing optimizers with RL preserves operational safeguards.
Offline RL with conservative regularization enhances training stability.
Regional reward aggregation captures marketplace network effects.

Method

The OWA-RL system uses a store-level policy to select a discrete multiplier for an existing combinatorial optimizer's objective. It's trained offline with Double DQN and Conservative Q-Learning using delayed regional rewards.

In practice

Implement RL as a constrained control layer over existing optimizers.
Use Double DQN with CQL for robust offline training stability.
Monitor state and action distributions for policy and marketplace drift.

Topics

Multi-Agent Reinforcement Learning
Offline Reinforcement Learning
Marketplace Dispatch
Objective-Weight Adaptation
Double Q-learning
Switchback Experiments

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.