Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading
Summary
Moira is a language-driven hierarchical reinforcement learning framework designed for pair trading, addressing challenges like delayed feedback and ambiguous credit assignment. It formulates pair trading as a hierarchical problem where a high-level "Selector" LLM chooses asset pairs based on long-horizon semantic reasoning, and a low-level "Trader" LLM executes short-horizon actions under partial observability. Both policies are parameterized by large language models (LLMs) and optimized exclusively through prompt updates, rather than gradient-based fine-tuning. The framework uses trajectory-level textual feedback for intra-episode Trader adaptation and episodic textual feedback for Selector updates, ensuring credit assignment aligns with feedback availability. Experiments on real-world U.S. equity market data, using 10 liquid U.S. stocks and daily news, demonstrate Moira's superior performance across financial metrics (AR, SR, Sortino, CR, MDD, AV, CVaR) compared to statistical, traditional RL, and flat LLM baselines.
Key takeaway
For AI Scientists and Machine Learning Engineers developing financial trading systems, Moira demonstrates that language-driven hierarchical reinforcement learning can significantly improve performance. You should consider adopting prompt-based optimization for both high-level strategic decisions and low-level execution, as this approach effectively manages delayed feedback and ambiguous credit assignment, leading to more stable and profitable trading strategies compared to traditional methods.
Key insights
Language-driven hierarchical reinforcement learning with prompt optimization effectively solves complex financial decision-making problems.
Principles
- Separate abstraction selection from execution.
- Align credit assignment with feedback availability.
- Use language as a shared semantic interface.
Method
Moira optimizes high-level pair selection and low-level trade execution LLM policies through textual prompt updates. A critic LLM provides natural-language feedback, which an updater LLM integrates into the policy prompts.
In practice
- Implement prompt-conditioned LLMs for hierarchical policies.
- Utilize textual feedback for policy adaptation.
- Apply asymmetric adaptation schedules for different hierarchical levels.
Topics
- Hierarchical Reinforcement Learning
- Language Models in Finance
- Pair Trading
- Prompt Optimization
- Textual Feedback
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.