Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
Summary
This paper introduces high-order generator regression for continuous-time policy evaluation from discrete, closed-loop trajectories under time-inhomogeneous dynamics. The method addresses the first-order limitation of the traditional Bellman baseline, which is derived from a one-step recursion and exhibits an $O(\Delta t)$ error. By estimating the time-dependent generator from multi-step transitions using moment-matching coefficients, the proposed approach cancels lower-order truncation terms, achieving higher-order accuracy (e.g., $O(\Delta t^2)$ for Gen2, $O(\Delta t^3)$ for Gen3). The authors provide an end-to-end theoretical decomposition of error into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error. Empirical studies across various benchmarks (2-dimensional pendulum, 4-dimensional coupled regulator, 12- and 24-dimensional networked linear-quadratic systems) demonstrate that the second-order estimator (Gen2) consistently improves upon the Bellman baseline, reducing integrated RMSE by 13% to 48%, particularly in regimes where theory predicts visible gains.
Key takeaway
For Machine Learning Engineers evaluating continuous-time policies from discrete trajectories, consider implementing high-order generator regression, specifically Gen2, to achieve significantly more accurate value estimates than the standard Bellman baseline. Your models will benefit from reduced discretization error, especially in non-stationary environments where the nonstationarity floor is low enough for second-order gains to be visible. Ensure your feature approximation class is sufficiently rich to realize these higher-order improvements.
Key insights
High-order generator regression improves continuous-time policy evaluation by canceling lower-order discretization errors.
Principles
- Bellman baseline is first-order in grid width.
- Multi-step moment matching yields higher-order generator surrogates.
- Higher-order gains depend on decision-frequency regime.
Method
Estimate the time-dependent generator from multi-step transitions using moment-matching coefficients, then combine with backward regression to solve the parabolic value equation, achieving $O(\Delta t^i)$ accuracy.
In practice
- Gen2 consistently outperforms Bellman baseline.
- Richer feature classes are needed for high-order gains.
- Gen2 is often the safest practical choice.
Topics
- Continuous-Time Policy Evaluation
- Generator Regression
- Bellman Baseline
- Multi-step Moment Matching
- Error Decomposition
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.