PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management
Summary
PortBench is a new benchmark for evaluating Large Language Models (LLMs) in multi-asset portfolio management, addressing gaps in existing benchmarks by incorporating cross-asset correlation structures and a full five-stage decision pipeline. It spans six heterogeneous asset classes over ten years, from January 2015 to December 2025, and includes a static QA dataset of 6,269 correlation-based questions and a dynamic allocation pipeline. PortBench introduces a dual-layer correlation score and CEPS (Cross-stage Error Propagation Score) to quantify diversification and reasoning error propagation. Evaluation of ten frontier LLMs revealed that 90% of model-profile combinations failed to outperform a basic equal-weight allocation, and models suffered catastrophic drawdowns under three historical stress regimes despite strong static QA performance. The source code is available on GitHub.
Key takeaway
For AI Scientists and Machine Learning Engineers developing LLMs for financial applications, you should prioritize building models that genuinely understand and utilize cross-asset correlation structures, rather than relying solely on static knowledge. Your models must demonstrate robust performance under diverse market stress conditions and adapt effectively to investor-specific risk profiles. Focus on improving execution accuracy and risk monitoring capabilities, as these are critical weaknesses where errors cascade into significant financial drawdowns.
Key insights
LLMs struggle with real-world portfolio management, failing to leverage correlation and exhibiting fragile performance under stress.
Principles
- Cross-asset correlation is crucial for genuine diversification.
- Errors in early decision stages cascade into poor outcomes.
- Static QA performance does not predict dynamic portfolio success.
Method
PortBench uses a dual-layer evaluation: a static QA dataset for correlation reasoning and a dynamic five-stage pipeline (market interpretation, signal generation, weight optimization, execution, risk monitoring) with CEPS and two-layer correlation scoring.
In practice
- Evaluate LLMs under diverse stress regimes.
- Assess LLM adaptation to investor risk profiles.
- Focus on tail-risk management, not just return generation.
Topics
- LLM Benchmarking
- Portfolio Management
- Financial LLMs
- Cross-Asset Correlation
- Market Stress Testing
- Risk Management
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.