FinBalance: A Multi-Document Accounting Reconciliation Benchmark
Summary
FinBalance is a new multi-document accounting reconciliation benchmark designed to evaluate large language models (LLMs) on real-world accounting tasks beyond prepared artifacts. This benchmark comprises source-document bundles from eight industries, three period types, and five difficulty levels. It utilizes a deterministic generator to compose human-authored business scenarios, accounting policies, tax/FX treatments, document schemas, distractors, and inconsistency templates, producing journal entries, balance sheets, and 23 inconsistency-code labels. On a 710-record evaluation split, six contemporary LLMs achieved a maximum of 46% exact final-balance-sheet accuracy. A significant 26-41 percentage point gap was observed between the models' reported balance sheets ("BS_exact") and those derived from replaying their entries ("BS_recon"), indicating issues with document binding and consistent aggregation. While citation-pressure prompting had minimal impact on document-linking errors, ledger-feedback ablations substantially improved reported balance sheets and highlighted trade-offs in inconsistency detection. Expert finance reviewers validated the benchmark's design and labels.
Key takeaway
For AI Scientists and NLP Engineers developing LLMs for financial accounting, recognize that current models achieve only 46% accuracy on multi-document reconciliation tasks. You should prioritize research into robust document-to-entry binding mechanisms and consistent aggregation logic, as these are critical failure points. Consider integrating ledger-feedback loops into your model training or inference pipelines to substantially improve balance sheet accuracy and address inconsistency detection trade-offs.
Key insights
LLMs struggle with multi-document accounting reconciliation, showing low accuracy and issues with document binding and consistent aggregation.
Principles
- Accounting reconciliation requires robust document-to-entry binding.
- Consistent aggregation of entries is a major LLM challenge.
- Ledger feedback improves LLM accounting task performance.
Method
FinBalance uses a deterministic generator to compose scenarios, policies, and templates, producing ledger entries, balance sheets, and 23 inconsistency labels for LLM evaluation.
In practice
- Evaluate LLMs on multi-document financial tasks.
- Develop LLM strategies for document-entry linking.
- Implement ledger feedback for accounting LLMs.
Topics
- FinBalance Benchmark
- Accounting Reconciliation
- Large Language Models
- Financial NLP
- Document Understanding
- Balance Sheet Accuracy
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.