How to build Multi Agents for FINANCE: Outperforming Anthropic

· Source: Discover AI · Field: Finance & Economics — FinTech & Digital Financial Services, Banking & Financial Services, Capital Markets & Investment Management · Depth: Expert, extended

Summary

IBM Research's February 26, 2026, evaluation of AI agents indicates a trade-off between cost and performance, with high-performing models like Claude Opus 4.5 costing $8-$45 for 73% accuracy, while GPT 5.2 offers lower costs ($0.25-$0.50) but reduced performance (39%). The article highlights that existing financial AI agents, such as those from Anthropic, primarily automate predefined workflows and data gathering rather than exhibiting true intelligence. A new study from February 25, 2026, introduces Yuan 4.0, a 36-billion-parameter open-source model that significantly outperforms proprietary models like Claude 4.5 by at least 9 percentage points in financial tasks. This outperformance is attributed to a novel training methodology using a "Financial Intelligence and Reasoning Evaluation" (FIRE) benchmark, which includes 14,000 theoretical questions and 3,000 real-world financial scenarios, and a dual reward system for reinforcement learning.

Key takeaway

For AI scientists and NLP engineers developing financial applications, relying solely on large proprietary models may not yield optimal performance or cost-efficiency. You should explore fine-tuning smaller, open-source models like Yuan 4.0 with domain-specific data and advanced training methodologies, such as the dual reward system and reverse chain-of-thought synthesis, to achieve superior results in complex financial reasoning tasks and potentially run models locally behind your firewall.

Key insights

Specialized, locally runnable LLMs can outperform large proprietary models in domain-specific tasks through targeted training and novel evaluation.

Principles

Method

Yuan 4.0's training involves continual pre-training with self-regularization, DPO-based fine-tuning, and a dual reward system (format and accuracy) to generate logical trajectories from human expert rationales, simulating human reasoning.

In practice

Topics

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.