FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Financial Statement Verification · Depth: Expert, quick

Summary

FinVerBench is a new benchmark and validity study designed for financial statement verification, assessing whether corporate financial statements are numerically consistent. It is constructed from SEC 10-K XBRL filings of 43 S&P 500 companies and features a four-category error taxonomy covering arithmetic, cross-statement linkage, year-over-year, and magnitude perturbations. The study evaluated fifteen contemporary LLMs, reporting fourteen complete runs (excluding one Gemini 2.5 Pro run due to 40/108 failed gateway calls). On an observable diagnostic subset of 105 instances (43 clean, 62 error-injected), nine of fourteen LLM runs produced 95-100% false positives on clean statements using an unrounded variant. However, one calibrated model achieved 0% observed false positives and 79.0% recall on a realistic rounded variant, compared to 100.0% recall on the unrounded version. These findings suggest financial statement verification requires calibrated judgment under incomplete observability and prompt-induced assumptions, rather than just arithmetic detection.

Key takeaway

For AI Scientists and Machine Learning Engineers developing financial verification systems, you must move beyond simple arithmetic checks. Your LLM evaluations should incorporate realistic data rendering, such as rounded figures, and account for incomplete observability. Calibrate your models to exercise judgment under these conditions, as benchmark rendering choices significantly impact measured recall and false positive rates. This approach ensures your solutions are robust and accurate for real-world financial statement verification.

Key insights

Financial statement verification by LLMs demands calibrated judgment, not just arithmetic, due to real-world data nuances.

Principles

Method

FinVerBench uses SEC 10-K XBRL filings to create a 105-instance diagnostic subset with a four-category error taxonomy, evaluating LLMs on unrounded and rounded data variants.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.