How do frontier models perform on real-world finance problems?
Summary
A recent evaluation assessed GPT-5, Claude Sonnet 4.5, and Gemini 2.5 Pro on over 200 finance scenarios across seven subcategories, including Basel capital calculations, commodities trading, and PowerPoint/Excel manipulation. While all three models exhibited sophistication, they consistently demonstrated systematic failure modes in areas such as calculation accuracy, regulatory compliance, and basic file handling. GPT-5 emerged as the top performer, winning 47% of tasks compared to Sonnet's 26% and Gemini's 24%, and leading in six out of seven domains. However, a holistic expert review found over 70% of responses to be mediocre to bad, with even GPT-5 showing persistent blind spots in reasoning and real-world process alignment, indicating a significant gap compared to human financial professionals.
Key takeaway
For CTOs and VPs of Engineering evaluating LLM integration into financial workflows, you should recognize that current models, even top performers like GPT-5, are not yet production-ready for critical tasks. Prioritize human oversight and robust validation for any LLM-generated financial output, especially concerning regulatory compliance and multi-step processes, to avoid significant operational risks and ensure accuracy.
Key insights
LLMs show promise in finance but consistently fail at real-world constraints, multi-step workflows, and professional conventions.
Principles
- Theoretical correctness does not equate to practical competence.
- Real-world financial tasks demand implicit professional conventions.
- Multi-step workflow execution is a common LLM failure point.
Method
Expert finance professionals developed 200+ scenarios across seven subcategories to evaluate LLMs on tasks like Basel capital calculations, Excel forecasting, and PowerPoint generation, focusing on real-world constraints and compliance.
In practice
- LLMs struggle with file handling and output fidelity.
- Models often misalign frameworks or apply incorrect methodologies.
- They lack "gut sense" for contextually absurd assumptions.
Topics
- Large Language Models
- Financial Services
- Model Evaluation
- AI Limitations
- Workflow Automation
Best for: CTO, VP of Engineering/Data, Executive, AI Engineer, AI Product Manager, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.