Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation

2026-04-22 · Source: stat.ML updates on arXiv.org · Field: Science & Research — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This paper introduces high-order generator regression for continuous-time policy evaluation from discrete, closed-loop trajectories under time-inhomogeneous dynamics. The method addresses the first-order limitation of the traditional Bellman baseline, which is derived from a one-step recursion and exhibits an $O(\Delta t)$ error. By estimating the time-dependent generator from multi-step transitions using moment-matching coefficients, the proposed approach cancels lower-order truncation terms, achieving higher-order accuracy (e.g., $O(\Delta t^2)$ for Gen2, $O(\Delta t^3)$ for Gen3). The authors provide an end-to-end theoretical decomposition of error into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error. Empirical studies across various benchmarks (2-dimensional pendulum, 4-dimensional coupled regulator, 12- and 24-dimensional networked linear-quadratic systems) demonstrate that the second-order estimator (Gen2) consistently improves upon the Bellman baseline, reducing integrated RMSE by 13% to 48%, particularly in regimes where theory predicts visible gains.

Key takeaway

For Machine Learning Engineers evaluating continuous-time policies from discrete trajectories, consider implementing high-order generator regression, specifically Gen2, to achieve significantly more accurate value estimates than the standard Bellman baseline. Your models will benefit from reduced discretization error, especially in non-stationary environments where the nonstationarity floor is low enough for second-order gains to be visible. Ensure your feature approximation class is sufficiently rich to realize these higher-order improvements.

Key insights

High-order generator regression improves continuous-time policy evaluation by canceling lower-order discretization errors.

Principles

Bellman baseline is first-order in grid width.
Multi-step moment matching yields higher-order generator surrogates.
Higher-order gains depend on decision-frequency regime.

Method

Estimate the time-dependent generator from multi-step transitions using moment-matching coefficients, then combine with backward regression to solve the parabolic value equation, achieving $O(\Delta t^i)$ accuracy.

In practice

Gen2 consistently outperforms Bellman baseline.
Richer feature classes are needed for high-order gains.
Gen2 is often the safest practical choice.

Topics

Continuous-Time Policy Evaluation
Generator Regression
Bellman Baseline
Multi-step Moment Matching
Error Decomposition

Code references

wagas165/beyond-bellman-ctpe

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.