TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

2026-06-07 · Source: Machine Learning · Field: Finance & Economics — Capital Markets & Investment Management, FinTech & Digital Financial Services · Depth: Expert, quick

Summary

TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing) is a novel deterministic actor-critic architecture designed for optimal execution of large stock sell programs. This algorithm integrates twin exponential-moving-average critic targets, pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to mitigate overestimation. Its exploration strategy employs Ornstein-Uhlenbeck (OU) noise with a hybrid schedule, combining deterministic episode-wise decay, variance-guided adjustment based on recent reward dispersion, and a learned Soft Actor-Critic (SAC)-style temperature mapped to the noise scale. The system operates within an environment that models Almgren-Chriss (AC) trade impact using Limit Order Book (LOB) prices and volumes, normalized state features, per-step volume participation caps, and a utility-based reward. Applied to LOB data for ten U.S. stocks, TT-DAC-PS consistently reduces mean implementation shortfall percentage with competitive variance, outperforming reinforcement learning baselines like PPO, SAC, and A2C, as well as classical trade execution algorithms such as TWAP, VWAP, and AC.

Key takeaway

For quantitative traders or machine learning engineers optimizing large stock sell programs, TT-DAC-PS offers a superior approach. You should consider integrating this deterministic actor-critic architecture. It significantly reduces mean implementation shortfall percentage and maintains competitive variance. This outperforms traditional and existing reinforcement learning methods, leading to more efficient and less costly execution of substantial equity trades.

Key insights

TT-DAC-PS is a novel RL algorithm for optimal trade execution, reducing implementation shortfall via advanced actor-critic techniques.

Principles

Combining twin critic targets with pessimistic min backup curbs overestimation.
Hybrid OU noise schedules enhance exploration in dynamic environments.
Conservative Q regularisation improves stability in actor-critic models.

Method

TT-DAC-PS employs a deterministic actor-critic architecture with twin exponential-moving-average critic targets, TD3-style policy smoothing, delayed actor updates, and conservative Q regularisation. Exploration uses hybrid Ornstein-Uhlenbeck noise.

In practice

Apply TT-DAC-PS to large stock sell programs.
Use LOB data for training and evaluation.
Benchmark against PPO, SAC, TWAP, VWAP.

Topics

Optimal Trade Execution
Reinforcement Learning
Actor-Critic Methods
Policy Smoothing
Limit Order Book
Algorithmic Trading

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.