How do frontier models perform on real-world finance problems?

2026-02-19 · Source: Surge AI Blog · Field: Finance & Economics — Banking & Financial Services, Capital Markets & Investment Management, FinTech & Digital Financial Services · Depth: Intermediate, long

Summary

A recent evaluation assessed GPT-5, Claude Sonnet 4.5, and Gemini 2.5 Pro on over 200 finance scenarios across seven subcategories, including Basel capital calculations, commodities trading, and PowerPoint/Excel manipulation. While all three models exhibited sophistication, they consistently demonstrated systematic failure modes in areas such as calculation accuracy, regulatory compliance, and basic file handling. GPT-5 emerged as the top performer, winning 47% of tasks compared to Sonnet's 26% and Gemini's 24%, and leading in six out of seven domains. However, a holistic expert review found over 70% of responses to be mediocre to bad, with even GPT-5 showing persistent blind spots in reasoning and real-world process alignment, indicating a significant gap compared to human financial professionals.

Key takeaway

For CTOs and VPs of Engineering evaluating LLM integration into financial workflows, you should recognize that current models, even top performers like GPT-5, are not yet production-ready for critical tasks. Prioritize human oversight and robust validation for any LLM-generated financial output, especially concerning regulatory compliance and multi-step processes, to avoid significant operational risks and ensure accuracy.

Key insights

LLMs show promise in finance but consistently fail at real-world constraints, multi-step workflows, and professional conventions.

Principles

Theoretical correctness does not equate to practical competence.
Real-world financial tasks demand implicit professional conventions.
Multi-step workflow execution is a common LLM failure point.

Method

Expert finance professionals developed 200+ scenarios across seven subcategories to evaluate LLMs on tasks like Basel capital calculations, Excel forecasting, and PowerPoint generation, focusing on real-world constraints and compliance.

In practice

LLMs struggle with file handling and output fidelity.
Models often misalign frameworks or apply incorrect methodologies.
They lack "gut sense" for contextually absurd assumptions.

Topics

Large Language Models
Financial Services
Model Evaluation
AI Limitations
Workflow Automation

Best for: CTO, VP of Engineering/Data, Executive, AI Engineer, AI Product Manager, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.