TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution
Summary
TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing) is a novel deterministic actor-critic architecture designed for optimal execution of large stock sell programs. This algorithm integrates twin exponential-moving-average critic targets, pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to mitigate overestimation. Its exploration strategy employs Ornstein-Uhlenbeck (OU) noise with a hybrid schedule, combining deterministic episode-wise decay, variance-guided adjustment based on recent reward dispersion, and a learned Soft Actor-Critic (SAC)-style temperature mapped to the noise scale. The system operates within an environment that models Almgren-Chriss (AC) trade impact using Limit Order Book (LOB) prices and volumes, normalized state features, per-step volume participation caps, and a utility-based reward. Applied to LOB data for ten U.S. stocks, TT-DAC-PS consistently reduces mean implementation shortfall percentage with competitive variance, outperforming reinforcement learning baselines like PPO, SAC, and A2C, as well as classical trade execution algorithms such as TWAP, VWAP, and AC.
Key takeaway
For quantitative traders or machine learning engineers optimizing large stock sell programs, TT-DAC-PS offers a superior approach. You should consider integrating this deterministic actor-critic architecture. It significantly reduces mean implementation shortfall percentage and maintains competitive variance. This outperforms traditional and existing reinforcement learning methods, leading to more efficient and less costly execution of substantial equity trades.
Key insights
TT-DAC-PS is a novel RL algorithm for optimal trade execution, reducing implementation shortfall via advanced actor-critic techniques.
Principles
- Combining twin critic targets with pessimistic min backup curbs overestimation.
- Hybrid OU noise schedules enhance exploration in dynamic environments.
- Conservative Q regularisation improves stability in actor-critic models.
Method
TT-DAC-PS employs a deterministic actor-critic architecture with twin exponential-moving-average critic targets, TD3-style policy smoothing, delayed actor updates, and conservative Q regularisation. Exploration uses hybrid Ornstein-Uhlenbeck noise.
In practice
- Apply TT-DAC-PS to large stock sell programs.
- Use LOB data for training and evaluation.
- Benchmark against PPO, SAC, TWAP, VWAP.
Topics
- Optimal Trade Execution
- Reinforcement Learning
- Actor-Critic Methods
- Policy Smoothing
- Limit Order Book
- Algorithmic Trading
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.