Scalar-Stepsize Nonuniform Monte Carlo Optimistic Policy Iteration: A Certified Counterexample
Summary
A new paper presents a certified counterexample to the convergence of Monte Carlo optimistic policy iteration (MCOPI) when using scalar-stepsize nonuniform updates. While Tsitsiklis previously proved MCOPI convergence under uniform update structures, this research demonstrates that fixed nonuniform state-selection probabilities can prevent convergence. The counterexample, set in a three-state, two-action discounted Markov Decision Process (MDP), shows that nonuniform update frequencies induce a diagonally scaled greedy-policy mean field. This leads to a nonconstant attracting hybrid periodic orbit, or cycle. Consequently, with a bounded unbiased geometric-horizon estimator and Robbins--Monro stepsizes, the original stochastic recursion remains trapped near this cycle with positive probability, failing to converge. The study pinpoints a geometric obstruction: uniform sampling ensures radial residual contraction, but scalar nonuniform sampling anisotropically distorts residual dynamics, generating switched attracting cycles.
Key takeaway
For AI scientists developing or applying Monte Carlo optimistic policy iteration, recognize that employing scalar-stepsize nonuniform state-selection probabilities can lead to non-convergence. Your implementations should carefully consider the implications of update frequency distribution, as anisotropic distortion of residual dynamics can trap the system in periodic orbits. Prioritize uniform sampling or investigate alternative convergence guarantees if nonuniform updates are essential for your specific MDP.
Key insights
Nonuniform state-selection probabilities can prevent Monte Carlo optimistic policy iteration from converging by inducing attracting cycles.
Principles
- Uniform sampling ensures radial residual contraction.
- Nonuniform sampling distorts residual dynamics anisotropically.
- Scalar-stepsize nonuniform updates can generate switched attracting cycles.
Method
The paper constructs a three-state, two-action discounted MDP where unnormalized asynchronous state-value recursion with fixed nonuniform state-selection probabilities leads to a nonconstant attracting hybrid periodic orbit, preventing convergence.
Topics
- Monte Carlo Policy Iteration
- Reinforcement Learning
- Markov Decision Processes
- Convergence Theory
- Nonuniform Sampling
- Stochastic Recursion
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.