FinBalance: A Multi-Document Accounting Reconciliation Benchmark

· Source: Computation and Language · Field: Finance & Economics — FinTech & Digital Financial Services, Artificial Intelligence & Machine Learning, Accounting & Financial Reporting · Depth: Advanced, quick

Summary

FinBalance is a new multi-document accounting reconciliation benchmark designed to evaluate large language models (LLMs) on real-world accounting tasks beyond prepared artifacts. This benchmark comprises source-document bundles from eight industries, three period types, and five difficulty levels. It utilizes a deterministic generator to compose human-authored business scenarios, accounting policies, tax/FX treatments, document schemas, distractors, and inconsistency templates, producing journal entries, balance sheets, and 23 inconsistency-code labels. On a 710-record evaluation split, six contemporary LLMs achieved a maximum of 46% exact final-balance-sheet accuracy. A significant 26-41 percentage point gap was observed between the models' reported balance sheets ("BS_exact") and those derived from replaying their entries ("BS_recon"), indicating issues with document binding and consistent aggregation. While citation-pressure prompting had minimal impact on document-linking errors, ledger-feedback ablations substantially improved reported balance sheets and highlighted trade-offs in inconsistency detection. Expert finance reviewers validated the benchmark's design and labels.

Key takeaway

For AI Scientists and NLP Engineers developing LLMs for financial accounting, recognize that current models achieve only 46% accuracy on multi-document reconciliation tasks. You should prioritize research into robust document-to-entry binding mechanisms and consistent aggregation logic, as these are critical failure points. Consider integrating ledger-feedback loops into your model training or inference pipelines to substantially improve balance sheet accuracy and address inconsistency detection trade-offs.

Key insights

LLMs struggle with multi-document accounting reconciliation, showing low accuracy and issues with document binding and consistent aggregation.

Principles

Method

FinBalance uses a deterministic generator to compose scenarios, policies, and templates, producing ledger entries, balance sheets, and 23 inconsistency labels for LLM evaluation.

In practice

Topics

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.