4 LLMs Tested in Codex, Claude Code, Hermes & OpenClaw (FinAI)

2026-05-17 · Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, AI for Capital Markets · Depth: Expert, extended

Summary

A study conducted by a consortium including Yale, Columbia, and Nvidia, utilizing 32,000 Nvidia GPU hours, evaluated four frontier LLMs (Claude 3 Sonnet 4.6, GPT-5.4, Qwen 3.5 400B, and Qwen 3.5 27B) across five agent frameworks (Claude Code, Codex, Hermes, Open Claw, React) for financial tasks. Published May 13, 2026, the research focused on trading, hedging, market insights, and auditing, assessing metrics like cumulative return and Sharpe ratio. Key findings indicate that agent framework architecture significantly impacts performance, with Claude Code and Open Claw often outperforming others, even with the same LLM backbone. However, a critical observation was the LLM agents' susceptibility to temporal overfitting, failing to generalize to new market conditions and underperforming simple buy-and-hold strategies in live evaluations from April-May 2026.

Key takeaway

For AI engineers developing financial trading or analysis systems, recognize that current LLM agents, even frontier models like GPT-5.4 and Claude Sonnet 4.6, exhibit severe temporal overfitting and fail to generalize to shifting market conditions. Your choice of agent framework (e.g., Claude Code vs. React) is as critical as the LLM itself, dramatically impacting accuracy and stability. Prioritize developing solutions that address long-horizon execution and dynamic market shifts, rather than solely scaling model parameters, to avoid catastrophic performance failures.

Key insights

LLM agent performance in finance heavily depends on agent framework and suffers from temporal overfitting.

Principles

Integrator matters as much as the Hamiltonian.
Generalization does not imply deterministic verification.
Scaling parameters is not sufficient for long-horizon tasks.

Method

The study benchmarked four LLMs across five agent frameworks using 32,000 GPU hours on forward financial tasks like trading and hedging, evaluating performance with financial metrics and live market data.

In practice

Pair Claude Sonnet 4.6 with Claude Code or Open Claw for auditing tasks.
Avoid Codex framework with Qwen 3.5 400B for trading tasks.
Recognize current LLM agents overfit past financial data.

Topics

LLM Benchmarking
Agent Frameworks
Financial Market Prediction
Temporal Overfitting
Claude Sonnet 4.6

Best for: AI Engineer, NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.