FinBalance: A Multi-Document Accounting Reconciliation Benchmark

2026-06-14 · Source: Computation and Language · Field: Finance & Economics — FinTech & Digital Financial Services, Artificial Intelligence & Machine Learning, Accounting & Financial Reporting · Depth: Advanced, quick

Summary

FinBalance is a new multi-document accounting reconciliation benchmark designed to evaluate large language models (LLMs) on real-world accounting tasks beyond prepared artifacts. This benchmark comprises source-document bundles from eight industries, three period types, and five difficulty levels. It utilizes a deterministic generator to compose human-authored business scenarios, accounting policies, tax/FX treatments, document schemas, distractors, and inconsistency templates, producing journal entries, balance sheets, and 23 inconsistency-code labels. On a 710-record evaluation split, six contemporary LLMs achieved a maximum of 46% exact final-balance-sheet accuracy. A significant 26-41 percentage point gap was observed between the models' reported balance sheets ("BS_exact") and those derived from replaying their entries ("BS_recon"), indicating issues with document binding and consistent aggregation. While citation-pressure prompting had minimal impact on document-linking errors, ledger-feedback ablations substantially improved reported balance sheets and highlighted trade-offs in inconsistency detection. Expert finance reviewers validated the benchmark's design and labels.

Key takeaway

For AI Scientists and NLP Engineers developing LLMs for financial accounting, recognize that current models achieve only 46% accuracy on multi-document reconciliation tasks. You should prioritize research into robust document-to-entry binding mechanisms and consistent aggregation logic, as these are critical failure points. Consider integrating ledger-feedback loops into your model training or inference pipelines to substantially improve balance sheet accuracy and address inconsistency detection trade-offs.

Key insights

LLMs struggle with multi-document accounting reconciliation, showing low accuracy and issues with document binding and consistent aggregation.

Principles

Accounting reconciliation requires robust document-to-entry binding.
Consistent aggregation of entries is a major LLM challenge.
Ledger feedback improves LLM accounting task performance.

Method

FinBalance uses a deterministic generator to compose scenarios, policies, and templates, producing ledger entries, balance sheets, and 23 inconsistency labels for LLM evaluation.

In practice

Evaluate LLMs on multi-document financial tasks.
Develop LLM strategies for document-entry linking.
Implement ledger feedback for accounting LLMs.

Topics

FinBalance Benchmark
Accounting Reconciliation
Large Language Models
Financial NLP
Document Understanding
Balance Sheet Accuracy

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.