4 LLMs Tested in Codex, Claude Code, Hermes & OpenClaw (FinAI)
Summary
A study conducted by a consortium including Yale, Columbia, and Nvidia, utilizing 32,000 Nvidia GPU hours, evaluated four frontier LLMs (Claude 3 Sonnet 4.6, GPT-5.4, Qwen 3.5 400B, and Qwen 3.5 27B) across five agent frameworks (Claude Code, Codex, Hermes, Open Claw, React) for financial tasks. Published May 13, 2026, the research focused on trading, hedging, market insights, and auditing, assessing metrics like cumulative return and Sharpe ratio. Key findings indicate that agent framework architecture significantly impacts performance, with Claude Code and Open Claw often outperforming others, even with the same LLM backbone. However, a critical observation was the LLM agents' susceptibility to temporal overfitting, failing to generalize to new market conditions and underperforming simple buy-and-hold strategies in live evaluations from April-May 2026.
Key takeaway
For AI engineers developing financial trading or analysis systems, recognize that current LLM agents, even frontier models like GPT-5.4 and Claude Sonnet 4.6, exhibit severe temporal overfitting and fail to generalize to shifting market conditions. Your choice of agent framework (e.g., Claude Code vs. React) is as critical as the LLM itself, dramatically impacting accuracy and stability. Prioritize developing solutions that address long-horizon execution and dynamic market shifts, rather than solely scaling model parameters, to avoid catastrophic performance failures.
Key insights
LLM agent performance in finance heavily depends on agent framework and suffers from temporal overfitting.
Principles
- Integrator matters as much as the Hamiltonian.
- Generalization does not imply deterministic verification.
- Scaling parameters is not sufficient for long-horizon tasks.
Method
The study benchmarked four LLMs across five agent frameworks using 32,000 GPU hours on forward financial tasks like trading and hedging, evaluating performance with financial metrics and live market data.
In practice
- Pair Claude Sonnet 4.6 with Claude Code or Open Claw for auditing tasks.
- Avoid Codex framework with Qwen 3.5 400B for trading tasks.
- Recognize current LLM agents overfit past financial data.
Topics
- LLM Benchmarking
- Agent Frameworks
- Financial Market Prediction
- Temporal Overfitting
- Claude Sonnet 4.6
Best for: AI Engineer, NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.