PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

· Source: cs.AI updates on arXiv.org · Field: Finance & Economics — Capital Markets & Investment Management, FinTech & Digital Financial Services · Depth: Expert, extended

Summary

PortBench is a new benchmark for evaluating Large Language Models (LLMs) in multi-asset portfolio management, addressing gaps in existing benchmarks by incorporating cross-asset correlation structures and a full five-stage decision pipeline. It spans six heterogeneous asset classes over ten years, from January 2015 to December 2025, and includes a static QA dataset of 6,269 correlation-based questions and a dynamic allocation pipeline. PortBench introduces a dual-layer correlation score and CEPS (Cross-stage Error Propagation Score) to quantify diversification and reasoning error propagation. Evaluation of ten frontier LLMs revealed that 90% of model-profile combinations failed to outperform a basic equal-weight allocation, and models suffered catastrophic drawdowns under three historical stress regimes despite strong static QA performance. The source code is available on GitHub.

Key takeaway

For AI Scientists and Machine Learning Engineers developing LLMs for financial applications, you should prioritize building models that genuinely understand and utilize cross-asset correlation structures, rather than relying solely on static knowledge. Your models must demonstrate robust performance under diverse market stress conditions and adapt effectively to investor-specific risk profiles. Focus on improving execution accuracy and risk monitoring capabilities, as these are critical weaknesses where errors cascade into significant financial drawdowns.

Key insights

LLMs struggle with real-world portfolio management, failing to leverage correlation and exhibiting fragile performance under stress.

Principles

Method

PortBench uses a dual-layer evaluation: a static QA dataset for correlation reasoning and a dynamic five-stage pipeline (market interpretation, signal generation, weight optimization, execution, risk monitoring) with CEPS and two-layer correlation scoring.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.