Path-Coupled Bellman Flows for Distributional Reinforcement Learning

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Path-Coupled Bellman Flows (PCBF) is a novel continuous-time distributional reinforcement learning (DRL) method that addresses limitations in existing flow-based approaches, such as boundary mismatch and high-variance bootstrapping. PCBF learns return distributions using flow matching with source-consistent Bellman-coupled paths. It ensures the current path starts from a required base prior at t=0 and reaches the Bellman target at t=1, maintaining an affine relation to the successor flow. By coupling current and successor return flows through shared base noise and employing a λ-parameterized control-variate target, PCBF trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL benchmarks demonstrate improved distributional fidelity, enhanced training stability, and competitive offline RL performance.

Key takeaway

For Machine Learning Engineers developing robust distributional reinforcement learning agents, PCBF offers a stable and accurate method for modeling return distributions. By addressing boundary mismatch and high-variance bootstrapping through source-consistent Bellman-coupled paths and shared-noise coupling, PCBF enhances distributional fidelity and training stability. You should consider integrating PCBF, particularly for applications requiring precise uncertainty quantification or operating in environments with heavy-tailed or multimodal returns, carefully tuning the λ parameter for optimal bias-variance trade-off.

Key insights

PCBF uses source-consistent, shared-noise Bellman-coupled paths and a λ-target to stabilize DRL flow matching.

Principles

Source-consistent Bellman paths fix t=0 boundary mismatch.
Shared-noise path coupling aligns current and successor flows.
λ-parameterized control variates balance bias and variance.

Method

PCBF learns a continuous neural velocity field by repairing the flow path to start from a base prior and end at the Bellman target, coupling current and successor flows via shared noise, and using a λ-parameterized control-variate target.

In practice

Accurately recovers ground-truth return distributions on toy MRPs.
Achieves competitive performance on OGBench and D4RL Adroit.
Reduces Bellman residual under coarse discretization with shared noise.

Topics

Distributional Reinforcement Learning
Flow Matching
Generative Models
Offline Reinforcement Learning
Bellman Operator
Variance Reduction

Code references

jax-ml/jax

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.