The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

2026-06-16 · Source: Artificial Intelligence · Field: Finance & Economics — Corporate Finance & Treasury, FinTech & Digital Financial Services, Capital Markets & Investment Management · Depth: Expert, quick

Summary

The Stanford EDGAR Filings Dataset (SEFD) is an open reconstruction of U.S. SEC filings, designed to address the scarcity of high-quality, long-context training data for large language models (LLMs). It transforms audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings into layout-faithful MultiMarkdown. This corpus is token-efficient, model-ready, and exhibits less than 0.1% overlap with Common Crawl-derived corpora. SEFD-v1, an initial public snapshot, comprises 152 billion tokens, with a larger archive estimated at 550 billion tokens across 18.5 million filings. The dataset supports financial language modeling, evaluation, reasoning, forecasting, compliance, and document understanding. Additionally, SEFD introduces two benchmarks: EDGAR-Forecast for numerical forecasting and EDGAR-OCR for complex financial table transcription.

Key takeaway

For machine learning engineers and AI scientists developing LLMs for financial applications, the Stanford EDGAR Filings Dataset offers a critical, open-source solution to data scarcity. You can utilize SEFD-v1's 152 billion tokens to pretrain models on authentic, layout-faithful financial disclosures, significantly improving performance on tasks like forecasting and compliance. Consider integrating the EDGAR-Forecast and EDGAR-OCR benchmarks into your evaluation pipelines to rigorously test model capabilities.

Key insights

The Stanford EDGAR Filings Dataset provides a novel, open source of layout-faithful financial data for LLM pretraining and evaluation.

Principles

High-quality long-context data is scarce.
Layout-faithful reconstruction enhances utility.
Open datasets can address proprietary data gaps.

Method

SEC filings are reconstructed into layout-faithful MultiMarkdown, making them token-efficient and model-ready for LLM pretraining and downstream tasks.

In practice

Train LLMs on audited financial statements.
Evaluate models on financial forecasting.
Develop tools for compliance and document understanding.

Topics

Stanford EDGAR Filings Dataset
SEC Filings
Large Language Models
Financial Language Modeling
Pretraining Data
Financial Forecasting
Document Understanding

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.