Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

2026-06-11 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

DoorDash has deployed a multi-agent reinforcement learning (MARL) system to adapt dispatch objective weights within its three-sided food-delivery marketplace. This system addresses the challenge of delayed operational feedback, such as delivery speed and courier utilization, by learning a store-level policy from logged marketplace data. Instead of replacing the existing combinatorial assignment optimizer, the policy selects a discrete multiplier that adjusts the optimizer's balance between delivery quality and batching efficiency. This approach facilitates offline policy learning under noisy, delayed, and coupled feedback while maintaining production feasibility and operational safeguards. The system trains a shared value function using centralized offline data and decentralized store-level execution, incorporating Double Q-learning targets and a conservative regularizer. A production switchback experiment confirmed the offline-trained policy successfully increased batching and reduced courier-side time costs without compromising customer-facing delivery quality.

Key takeaway

For MLOps Engineers managing complex logistics or marketplace dispatch systems, this work demonstrates a robust strategy for online policy adaptation. You should consider integrating reinforcement learning to adapt objective weights for existing optimizers, rather than attempting full replacement. This approach allows you to safely improve operational efficiency, like batching and courier costs, using delayed marketplace feedback without degrading critical service quality metrics.

Key insights

Reinforcement learning can safely adapt dispatch objective weights in complex, real-world marketplaces using delayed feedback.

Principles

Adapt existing optimizers, don't replace them.
Use discrete multipliers for policy-optimizer interface.
Centralized data, decentralized execution.

Method

A store-level policy learns from logged data to select a discrete multiplier, shifting an existing combinatorial optimizer's tradeoff. It uses a shared value function, Double Q-learning, and a conservative regularizer.

In practice

Integrate RL with existing optimization systems.
Apply Double Q-learning for value estimation.
Use conservative regularization for OOD values.

Topics

Multi-Agent Reinforcement Learning
Three-Sided Marketplaces
Dispatch Optimization
Objective Weight Adaptation
Offline Reinforcement Learning
DoorDash

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.