500 investment bankers review AI outputs and find none ready for client delivery
Summary
Handshake AI and McGill University have released BankerToolBench, an open-source benchmark evaluating AI agents on junior investment banking tasks. The benchmark, which involved 500 current and former bankers, tested nine top models including GPT-5.4 and Claude Opus 4.6 on deliverables like Excel financial models, PowerPoint decks, and PDF reports. Not a single AI output was deemed client-ready, with 41% requiring major rework and 27% being unusable. GPT-5.4 led the models, scoring 58.1 out of 100, but still failed nearly half the criteria, with only 16% of its outputs considered a useful starting point. Common failure modes included code/formula generation bugs (41%), broken business logic (27%), aborted data queries (18%), and fabricated numbers (13%). The benchmark also highlighted issues like hardcoded values in Excel models from Claude Opus 4.6, making scenario analysis impossible.
Key takeaway
For VPs of Engineering or Data leading AI integration in finance, recognize that current large language models are not yet suitable for autonomous client-facing work. Your teams should focus on deploying AI as an assistive tool for junior analysts, requiring significant human oversight and validation, especially for complex tasks involving financial modeling and data synthesis. Prioritize robust error detection and validation workflows to mitigate risks from subtle inaccuracies and fabricated data.
Key insights
AI models currently lack the accuracy and reliability for client-facing financial deliverables, despite appearing polished.
Principles
- Technical correctness outweighs superficial polish in financial AI.
- Domain knowledge is critical for AI performance in specialized fields.
Method
BankerToolBench uses 100 real-world tasks, 500 bankers for task design and review, and an AI verifier (Gandalf) to grade Excel, PowerPoint, and Word outputs against 150 criteria.
In practice
- Use AI outputs as starting points, not final deliverables.
- Focus AI training on domain-specific context and tool use.
- Prioritize formulaic integrity over hardcoded values in financial models.
Topics
- BankerToolBench
- Investment Banking AI
- AI Performance Benchmarking
- Financial Modeling
- Generative AI Limitations
Code references
Best for: CTO, VP of Engineering/Data, Executive, AI Scientist, Director of AI/ML, AI Product Manager
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.