LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, FinTech & Digital Financial Services · Depth: Expert, quick

Summary

LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval) is a new corpus and benchmark designed for rigorous evaluation of large language models in financial reporting. Released on 2026-06-11, it comprises 4,999 digitized corporate annual reports, including full documents with figures, tables, and narrative, moving beyond plain-text SEC 10-K filings. Each report is labeled with 31 consolidated financial KPIs, linked to market reactions at earnings dates. The dataset supports three evaluation benchmarks: a page-level KPI retrieval task with 118,048 natural language questions, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task from long, numerically dense reports. LEDGER also provides human OCR-quality annotations and a complete toolchain for extraction, validation, and scoring. A case study demonstrates its utility by linking CEO-letter rhetoric to post-publication market impact.

Key takeaway

For NLP Engineers and Research Scientists developing long-context LLMs for financial applications, LEDGER offers a robust, multi-faceted benchmark to rigorously evaluate model performance beyond simple 10-K filings. You should integrate LEDGER's retrieval and extraction tasks into your evaluation pipelines to assess real-world capabilities. This will help identify areas for improvement in handling complex, numerically dense corporate reports and ensure your models provide grounded financial insights.

Key insights

LEDGER offers a comprehensive benchmark for long-context LLMs to evaluate financial KPI extraction and retrieval from full corporate reports.

Principles

Method

LEDGER digitizes 4,999 corporate annual reports, labels 31 financial KPIs, and derives three benchmarks: page-level retrieval, conversational lookup, and full KPI extraction, supported by human OCR annotations and a toolchain.

In practice

Topics

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.