FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

FinTrace is a new benchmark and training dataset designed to holistically evaluate and improve large language models' (LLMs) tool-calling capabilities for complex, long-horizon financial tasks. It comprises 800 expert-annotated trajectories across 34 real-world financial task categories, moving beyond call-level metrics to assess trajectory-level reasoning. FinTrace employs a nine-metric rubric evaluating action correctness, execution efficiency, process quality, and output quality. An evaluation of 13 LLMs, including GPT-5.4 and Claude-Opus-4.6, revealed that while frontier models excel at tool selection, all models struggle with information utilization and final answer quality. To address this, FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, was created with 8,196 curated trajectories. Fine-tuning Qwen-3.5-9B using supervised fine-tuning (SFT) followed by direct preference optimization (DPO) on this dataset improved intermediate reasoning and suppressed failure modes, though end-to-end answer quality remains a bottleneck.

Key takeaway

For AI Engineers developing financial LLM agents, this research highlights that strong tool selection alone is insufficient; models must also excel at reasoning over tool outputs. You should prioritize trajectory-level training using preference datasets like FinTrace-Training to enhance process quality and information utilization. While SFT improves reasoning, DPO is critical for suppressing inefficient or redundant tool calls. Focus future development on bridging the gap between improved intermediate reasoning and consistently accurate final answers.

Key insights

Trajectory-level evaluation and training are crucial for LLMs to master complex financial tool-calling beyond basic tool selection.

Principles

Multi-dimensional rubrics reveal nuanced LLM agent performance.
Trajectory-level training improves intermediate reasoning.
DPO effectively reduces redundant tool calls.

Method

FinTrace uses a multi-stage pipeline: query curation, LLM-generated candidate trajectories, expert validation, and iterative refinement to create golden-label trajectories for evaluation and training.

In practice

Use FinTrace to benchmark financial LLM agents.
Apply SFT+DPO with trajectory-level data for agent improvement.
Focus on information utilization for better final answer quality.

Topics

LLM Tool Calling
Financial LLM Agents
Trajectory-Level Evaluation
FinTrace Benchmark
Direct Preference Optimization

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.