Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward
Summary
This research introduces a Maximum Entropy Inverse Reinforcement Learning (IRL) framework for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. The method recovers an unknown reward policy from expert demonstrations, assuming they arise from a stationary mean-field equilibrium. It formulates the inverse problem by ensuring consistency with the expert mean-field term and long-run feature expectations, using a unified occupation-measure framework for two reward classes. For finite-dimensional linear rewards, a convex dual reformulation with an explicit log-partition objective supports constant-step-size gradient descent. For infinite-dimensional RKHS rewards, a Lagrangian relaxation yields a policy defined by a soft Bellman equation. A minorisation-based sub-stochastic kernel resolves the absence of a discount-factor contraction, ensuring a strict contraction of the soft Bellman operator. The framework establishes Fréchet differentiability and Lipschitz smoothness for the log-likelihood score, enabling a gradient ascent algorithm with convergence guarantees. Numerical examples, like a malware-spread MFG, confirm recovered policies match expert behavior.
Key takeaway
For Machine Learning Engineers developing multi-agent systems, this research offers a robust approach to infer underlying reward structures from observed collective behavior. You can apply this Maximum Entropy IRL framework to understand complex mean-field game dynamics, even without a discount factor. Consider implementing the proposed gradient ascent algorithm, especially for systems like malware propagation or consumer choice models, to accurately recover expert policies and improve your system's predictive capabilities.
Key insights
Recovering unknown reward policies in average-reward MFGs is feasible using maximum causal entropy and a novel contraction method.
Principles
- Max causal entropy explains expert behavior.
- Consistency with mean-field terms is key.
- Sub-stochastic kernels enable contraction.
Method
The method formulates an inverse problem enforcing consistency with expert mean-field terms and long-run feature expectations. It uses a convex dual for linear rewards and a Lagrangian relaxation with a minorisation-based sub-stochastic kernel for RKHS rewards, followed by gradient ascent.
In practice
- Model malware spread dynamics.
- Analyze consumer choice patterns.
- Recover reward functions from observed multi-agent behavior.
Topics
- Inverse Reinforcement Learning
- Mean-Field Games
- Maximum Entropy Principle
- Average Reward Criterion
- Soft Bellman Equation
- Gradient Ascent Algorithms
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.