Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

· Source: stat.ML updates on arXiv.org · Field: Finance & Economics — Capital Markets & Investment Management, FinTech & Digital Financial Services · Depth: Expert, quick

Summary

A new framework called FPILOT (Financial Plugin Inference-time Learning for Optimal Trading) has been proposed to enhance reinforcement learning (RL) agents for portfolio management. Submitted on May 12, 2026, FPILOT addresses the limitation of static RL policies by integrating inference-time optimization using price forecasts. Inspired by Model Predictive Control (MPC), it leverages a predictive model to generate multi-step price trajectories, assuming an agent's portfolio allocation minimally impacts future prices. At each decision step, FPILOT constructs an allocation-based imagined return objective and optimizes the policy before executing a trade, without requiring agent retraining. Evaluated on the TradeMaster DJ30 benchmark across five policy learning algorithms, FPILOT consistently improved total return and risk-adjusted metrics like Sharpe, Sortino, and Calmar ratios, with stochastic policies showing greater benefits. Performance gains also correlated directly with the quality of synthetic forecasts.

Key takeaway

For quantitative analysts and AI scientists developing portfolio management systems, FPILOT offers a significant enhancement to existing RL trading agents. You should consider integrating this inference-time optimization framework to adapt your pre-trained policies to real-time price forecasts, potentially boosting total returns and improving risk-adjusted metrics like Sharpe and Sortino ratios without requiring costly retraining. Focus on improving your financial forecasting models, as higher forecast quality directly translates to better trading performance.

Key insights

FPILOT enhances RL trading agents by integrating inference-time optimization with price forecasts, improving returns and risk metrics.

Principles

Method

FPILOT uses a predictive model to generate multi-step price trajectories, then optimizes the pre-trained policy at inference-time based on an imagined return objective before executing a single trade step.

In practice

Topics

Best for: AI Scientist, Data Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.