Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading

2026-05-05 · Source: cs.MA updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Moira is a language-driven hierarchical reinforcement learning framework designed for pair trading, addressing challenges like delayed feedback and ambiguous credit assignment. It formulates pair trading as a hierarchical problem where a high-level "Selector" LLM chooses asset pairs based on long-horizon semantic reasoning, and a low-level "Trader" LLM executes short-horizon actions under partial observability. Both policies are parameterized by large language models (LLMs) and optimized exclusively through prompt updates, rather than gradient-based fine-tuning. The framework uses trajectory-level textual feedback for intra-episode Trader adaptation and episodic textual feedback for Selector updates, ensuring credit assignment aligns with feedback availability. Experiments on real-world U.S. equity market data, using 10 liquid U.S. stocks and daily news, demonstrate Moira's superior performance across financial metrics (AR, SR, Sortino, CR, MDD, AV, CVaR) compared to statistical, traditional RL, and flat LLM baselines.

Key takeaway

For AI Scientists and Machine Learning Engineers developing financial trading systems, Moira demonstrates that language-driven hierarchical reinforcement learning can significantly improve performance. You should consider adopting prompt-based optimization for both high-level strategic decisions and low-level execution, as this approach effectively manages delayed feedback and ambiguous credit assignment, leading to more stable and profitable trading strategies compared to traditional methods.

Key insights

Language-driven hierarchical reinforcement learning with prompt optimization effectively solves complex financial decision-making problems.

Principles

Separate abstraction selection from execution.
Align credit assignment with feedback availability.
Use language as a shared semantic interface.

Method

Moira optimizes high-level pair selection and low-level trade execution LLM policies through textual prompt updates. A critic LLM provides natural-language feedback, which an updater LLM integrates into the policy prompts.

In practice

Implement prompt-conditioned LLMs for hierarchical policies.
Utilize textual feedback for policy adaptation.
Apply asymmetric adaptation schedules for different hierarchical levels.

Topics

Hierarchical Reinforcement Learning
Language Models in Finance
Pair Trading
Prompt Optimization
Textual Feedback

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.