Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

This paper introduces Decentralized Diffusion Policy Learning (DDPL), a novel approach to cooperative multi-agent reinforcement learning (MARL) that enhances exploration by parameterizing agent policies with denoising diffusion probabilistic models (DDPMs). The authors demonstrate that traditional Gaussian policies, commonly used in decentralized softmax policy gradient (DecSPG) algorithms, severely limit exploration and lead to suboptimal equilibria, a problem that worsens with an increasing number of agents. DDPL addresses this by enabling efficient online training of diffusion policies through importance sampling score matching (ISSM), a new method with theoretical guarantees. Empirical evaluations on continuous-action MARL benchmarks, including multi-agent particle environment (MPE), multi-agent MuJoCo (MaMuJoCo), IsaacLab, and JAX-reimplemented StarCraft Multi-Agent Challenge (SMAX), show consistent performance improvements and better sample efficiency compared to existing baselines like HAPPO, MAPPO, and HASAC.

Key takeaway

For research scientists developing cooperative MARL systems, DDPL offers a robust solution to the exploration challenge. You should consider adopting diffusion models for policy parameterization, as they demonstrably capture multi-modal action distributions crucial for discovering optimal coordination patterns. This approach, particularly with the efficient ISSM training method, can prevent premature convergence to suboptimal equilibria, especially in environments with a high number of agents, leading to significantly improved performance and sample efficiency.

Key insights

Diffusion policies enable multi-modal exploration in MARL, outperforming restrictive Gaussian policies.

Principles

Method

DDPL parameterizes decentralized policies with DDPMs and trains them online using Importance Sampling Score Matching (ISSM), which moves diffusion policies toward energy-based target distributions without requiring target samples.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.