500 investment bankers review AI outputs and find none ready for client delivery

2026-04-26 · Source: The Decoder · Field: Finance & Economics — Banking & Financial Services, FinTech & Digital Financial Services, Artificial Intelligence & Machine Learning · Depth: Intermediate, medium

Summary

Handshake AI and McGill University have released BankerToolBench, an open-source benchmark evaluating AI agents on junior investment banking tasks. The benchmark, which involved 500 current and former bankers, tested nine top models including GPT-5.4 and Claude Opus 4.6 on deliverables like Excel financial models, PowerPoint decks, and PDF reports. Not a single AI output was deemed client-ready, with 41% requiring major rework and 27% being unusable. GPT-5.4 led the models, scoring 58.1 out of 100, but still failed nearly half the criteria, with only 16% of its outputs considered a useful starting point. Common failure modes included code/formula generation bugs (41%), broken business logic (27%), aborted data queries (18%), and fabricated numbers (13%). The benchmark also highlighted issues like hardcoded values in Excel models from Claude Opus 4.6, making scenario analysis impossible.

Key takeaway

For VPs of Engineering or Data leading AI integration in finance, recognize that current large language models are not yet suitable for autonomous client-facing work. Your teams should focus on deploying AI as an assistive tool for junior analysts, requiring significant human oversight and validation, especially for complex tasks involving financial modeling and data synthesis. Prioritize robust error detection and validation workflows to mitigate risks from subtle inaccuracies and fabricated data.

Key insights

AI models currently lack the accuracy and reliability for client-facing financial deliverables, despite appearing polished.

Principles

Technical correctness outweighs superficial polish in financial AI.
Domain knowledge is critical for AI performance in specialized fields.

Method

BankerToolBench uses 100 real-world tasks, 500 bankers for task design and review, and an AI verifier (Gandalf) to grade Excel, PowerPoint, and Word outputs against 150 criteria.

In practice

Use AI outputs as starting points, not final deliverables.
Focus AI training on domain-specific context and tool use.
Prioritize formulaic integrity over hardcoded values in financial models.

Topics

BankerToolBench
Investment Banking AI
AI Performance Benchmarking
Financial Modeling
Generative AI Limitations

Code references

Handshake-AI-Research/bankertoolbench

Best for: CTO, VP of Engineering/Data, Executive, AI Scientist, Director of AI/ML, AI Product Manager

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.