PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

2024-06-03 · Source: cs.AI updates on arXiv.org · Field: Finance & Economics — Capital Markets & Investment Management, FinTech & Digital Financial Services · Depth: Expert, extended

Summary

PortBench is a new benchmark for evaluating Large Language Models (LLMs) in multi-asset portfolio management, addressing gaps in existing benchmarks by incorporating cross-asset correlation structures and a full five-stage decision pipeline. It spans six heterogeneous asset classes over ten years, from January 2015 to December 2025, and includes a static QA dataset of 6,269 correlation-based questions and a dynamic allocation pipeline. PortBench introduces a dual-layer correlation score and CEPS (Cross-stage Error Propagation Score) to quantify diversification and reasoning error propagation. Evaluation of ten frontier LLMs revealed that 90% of model-profile combinations failed to outperform a basic equal-weight allocation, and models suffered catastrophic drawdowns under three historical stress regimes despite strong static QA performance. The source code is available on GitHub.

Key takeaway

For AI Scientists and Machine Learning Engineers developing LLMs for financial applications, you should prioritize building models that genuinely understand and utilize cross-asset correlation structures, rather than relying solely on static knowledge. Your models must demonstrate robust performance under diverse market stress conditions and adapt effectively to investor-specific risk profiles. Focus on improving execution accuracy and risk monitoring capabilities, as these are critical weaknesses where errors cascade into significant financial drawdowns.

Key insights

LLMs struggle with real-world portfolio management, failing to leverage correlation and exhibiting fragile performance under stress.

Principles

Cross-asset correlation is crucial for genuine diversification.
Errors in early decision stages cascade into poor outcomes.
Static QA performance does not predict dynamic portfolio success.

Method

PortBench uses a dual-layer evaluation: a static QA dataset for correlation reasoning and a dynamic five-stage pipeline (market interpretation, signal generation, weight optimization, execution, risk monitoring) with CEPS and two-layer correlation scoring.

In practice

Evaluate LLMs under diverse stress regimes.
Assess LLM adaptation to investor risk profiles.
Focus on tail-risk management, not just return generation.

Topics

LLM Benchmarking
Portfolio Management
Financial LLMs
Cross-Asset Correlation
Market Stress Testing
Risk Management

Code references

AgenticFinLab/portbench

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.