FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification
Summary
FinVerBench is a new benchmark and validity study designed for financial statement verification, assessing whether corporate financial statements are numerically consistent. It is constructed from SEC 10-K XBRL filings of 43 S&P 500 companies and features a four-category error taxonomy covering arithmetic, cross-statement linkage, year-over-year, and magnitude perturbations. The study evaluated fifteen contemporary LLMs, reporting fourteen complete runs (excluding one Gemini 2.5 Pro run due to 40/108 failed gateway calls). On an observable diagnostic subset of 105 instances (43 clean, 62 error-injected), nine of fourteen LLM runs produced 95-100% false positives on clean statements using an unrounded variant. However, one calibrated model achieved 0% observed false positives and 79.0% recall on a realistic rounded variant, compared to 100.0% recall on the unrounded version. These findings suggest financial statement verification requires calibrated judgment under incomplete observability and prompt-induced assumptions, rather than just arithmetic detection.
Key takeaway
For AI Scientists and Machine Learning Engineers developing financial verification systems, you must move beyond simple arithmetic checks. Your LLM evaluations should incorporate realistic data rendering, such as rounded figures, and account for incomplete observability. Calibrate your models to exercise judgment under these conditions, as benchmark rendering choices significantly impact measured recall and false positive rates. This approach ensures your solutions are robust and accurate for real-world financial statement verification.
Key insights
Financial statement verification by LLMs demands calibrated judgment, not just arithmetic, due to real-world data nuances.
Principles
- Benchmark rendering choices materially affect recall.
- Verification involves judgment under incomplete observability.
- Prompt-induced assumptions influence LLM performance.
Method
FinVerBench uses SEC 10-K XBRL filings to create a 105-instance diagnostic subset with a four-category error taxonomy, evaluating LLMs on unrounded and rounded data variants.
In practice
- Test LLMs on rounded financial data variants.
- Design prompts to account for incomplete observability.
- Calibrate LLM judgment for financial verification tasks.
Topics
- Financial Statement Verification
- Large Language Models
- FinVerBench
- XBRL Filings
- Benchmark Validity
- Model Calibration
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.