FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
Summary
FinTrace is a new benchmark and training dataset designed to holistically evaluate and improve large language models' (LLMs) tool-calling capabilities for complex, long-horizon financial tasks. It comprises 800 expert-annotated trajectories across 34 real-world financial task categories, moving beyond call-level metrics to assess trajectory-level reasoning. FinTrace employs a nine-metric rubric evaluating action correctness, execution efficiency, process quality, and output quality. An evaluation of 13 LLMs, including GPT-5.4 and Claude-Opus-4.6, revealed that while frontier models excel at tool selection, all models struggle with information utilization and final answer quality. To address this, FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, was created with 8,196 curated trajectories. Fine-tuning Qwen-3.5-9B using supervised fine-tuning (SFT) followed by direct preference optimization (DPO) on this dataset improved intermediate reasoning and suppressed failure modes, though end-to-end answer quality remains a bottleneck.
Key takeaway
For AI Engineers developing financial LLM agents, this research highlights that strong tool selection alone is insufficient; models must also excel at reasoning over tool outputs. You should prioritize trajectory-level training using preference datasets like FinTrace-Training to enhance process quality and information utilization. While SFT improves reasoning, DPO is critical for suppressing inefficient or redundant tool calls. Focus future development on bridging the gap between improved intermediate reasoning and consistently accurate final answers.
Key insights
Trajectory-level evaluation and training are crucial for LLMs to master complex financial tool-calling beyond basic tool selection.
Principles
- Multi-dimensional rubrics reveal nuanced LLM agent performance.
- Trajectory-level training improves intermediate reasoning.
- DPO effectively reduces redundant tool calls.
Method
FinTrace uses a multi-stage pipeline: query curation, LLM-generated candidate trajectories, expert validation, and iterative refinement to create golden-label trajectories for evaluation and training.
In practice
- Use FinTrace to benchmark financial LLM agents.
- Apply SFT+DPO with trajectory-level data for agent improvement.
- Focus on information utilization for better final answer quality.
Topics
- LLM Tool Calling
- Financial LLM Agents
- Trajectory-Level Evaluation
- FinTrace Benchmark
- Direct Preference Optimization
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.