Finding the Time to Think: Learning Planning Budgets in Real-Time RL

2026-06-24 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new formalization for real-time reinforcement learning (RL), termed variable-delay real-time RL, is introduced, addressing scenarios where the environment continues to progress while an agent deliberates. Unlike standard RL where environments wait indefinitely, this setting requires agents to choose their deliberation time, or "planning budget," at each decision point. Recognizing that the optimal planning budget is state-dependent and that meta-planning is inefficient, the research proposes training a lightweight "gating policy." This policy is designed to select appropriate state-dependent planning budgets for an underlying planner. Evaluated across real-time versions of Pac-Man, Tetris, Snake, Speed Hex, and Speed Go, the gating policy consistently outperforms both fixed-budget and heuristic baselines. Furthermore, the approach demonstrates successful transferability to a real-time setup involving an environment and agent running on two different GPUs.

Key takeaway

For Machine Learning Engineers designing real-time RL agents, you should consider implementing a learned gating policy to dynamically manage planning budgets. This approach significantly improves performance over fixed-delay or heuristic methods by adapting deliberation time to the current state, even in distributed environments. You can enhance agent responsiveness and efficiency in time-sensitive applications like robotics or autonomous systems.

Key insights

The paper addresses real-time RL by learning state-dependent planning budgets via a lightweight gating policy, outperforming fixed-delay methods.

Principles

Optimal planning budgets are state-dependent in real-time RL.
Meta-planning for deliberation time can paralyze agents.
Environment progression during deliberation is a key real-time constraint.

Method

A lightweight gating policy is trained atop a planner to dynamically select state-dependent planning budgets. This avoids explicit meta-planning for deliberation time in variable-delay real-time RL.

In practice

Apply gating policies to optimize planning in real-time games.
Test variable-delay RL in multi-GPU agent-environment setups.
Improve agent performance over fixed-budget planning.

Topics

Real-time Reinforcement Learning
Planning Budgets
Gating Policy
Variable-Delay RL
Multi-GPU Systems
Game AI

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.